The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,
! kaggle competitions files home-credit-default-risk
It is quite easy to setup, it takes me less than 15 minutes to finish a submission.
kaggle.json filekaggle.json in the right placeFor more detailed information on setting the Kaggle API see here and here.
!pip install kaggle
Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages (1.5.12) Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.64.0) Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.23.0) Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle) (6.1.1) Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.24.3) Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.15.0) Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from kaggle) (2021.10.8) Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.8.2) Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (3.0.4) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (2.10)
!pwd
# /root/shared/PycharmProjects/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/WIP_phase-1/kaggle.json
/content
!ls
sample_data
!rm -rf ~/.kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions files home-credit-default-risk
name size creationDate ---------------------------------- ----- ------------------- application_test.csv 25MB 2019-12-11 02:55:35 sample_submission.csv 524KB 2019-12-11 02:55:35 credit_card_balance.csv 405MB 2019-12-11 02:55:35 previous_application.csv 386MB 2019-12-11 02:55:35 HomeCredit_columns_description.csv 37KB 2019-12-11 02:55:35 bureau.csv 162MB 2019-12-11 02:55:35 bureau_balance.csv 358MB 2019-12-11 02:55:35 application_train.csv 158MB 2019-12-11 02:55:35 installments_payments.csv 690MB 2019-12-11 02:55:35 POS_CASH_balance.csv 375MB 2019-12-11 02:55:35
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
There are 7 different sources of data:
Create a base directory:
DATA_DIR = "../../../Data/home-credit-default-risk" #same level as course repo in the data directory
Please download the project data files and data dictionary and unzip them using either of the following approaches:
Download button on the following Data Webpage and unzip the zip file to the BASE_DIR# Commented by kiran
# DATA_DIR = "../../../Data/home-credit-default-risk" #same level as course repo in the data directory
# DATA_DIR = os.path.join('./ddddd/')
# !mkdir $DATA_DIR
# data dir for kiran
DATA_DIR = "../content/"
# Google collab dir: Account: kikarand@iu.edu
# DATA_DIR="gdrive/MyDrive/data/"
!ls -l $DATA_DIR
total 8 -rw-r--r-- 1 root root 65 Apr 20 15:28 kaggle.json drwxr-xr-x 1 root root 4096 Apr 8 13:32 sample_data
# Added to download files in google collab
from google.colab import drive,files
drive.mount('/content/gdrive')
#this will prompt you to upload the kaggle.json
# files.upload()
!ls -lha kaggle.json
Mounted at /content/gdrive -rw-r--r-- 1 root root 65 Apr 20 15:28 kaggle.json
!rm -rf ~/.kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
DATA_DIR = "/content/"
!mkdir $DATA_DIR
!ls -l $DATA_DIR
mkdir: cannot create directory ‘/content/’: File exists total 12 drwx------ 6 root root 4096 Apr 20 15:29 gdrive -rw-r--r-- 1 root root 65 Apr 20 15:28 kaggle.json drwxr-xr-x 1 root root 4096 Apr 8 13:32 sample_data
# IF download file do not exists
! kaggle competitions download -c home-credit-default-risk
Downloading home-credit-default-risk.zip to /content 98% 675M/688M [00:03<00:00, 186MB/s] 100% 688M/688M [00:03<00:00, 182MB/s]
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings('ignore')
unzippingReq = True
if unzippingReq: #please modify this code
zip_ref = zipfile.ZipFile('home-credit-default-risk.zip', 'r')
zip_ref.extractall()
zip_ref.close()
# zip_ref = zipfile.ZipFile('application_test.csv.zip', 'r')
# zip_ref.extractall('datasets')
# zip_ref.close()
# zip_ref = zipfile.ZipFile('bureau_balance.csv.zip', 'r')
# zip_ref.extractall('datasets')
# zip_ref.close()
# zip_ref = zipfile.ZipFile('bureau.csv.zip', 'r')
# zip_ref.extractall('datasets')
# zip_ref.close()
# zip_ref = zipfile.ZipFile('credit_card_balance.csv.zip', 'r')
# zip_ref.extractall('datasets')
# zip_ref.close()
# zip_ref = zipfile.ZipFile('installments_payments.csv.zip', 'r')
# zip_ref.extractall('datasets')
# zip_ref.close()
# zip_ref = zipfile.ZipFile('POS_CASH_balance.csv.zip', 'r')
# zip_ref.extractall('datasets')
# zip_ref.close()
# zip_ref = zipfile.ZipFile('previous_application.csv.zip', 'r')
# zip_ref.extractall('datasets')
# zip_ref.close()
data_dict_path = os.path.join(DATA_DIR, "HomeCredit_columns_description.csv")
data_dict = pd.read_csv(data_dict_path, engine="python", encoding="utf-8" ,header=0, encoding_errors='ignore')
data_dict["Table"].unique()
array(['application_{train|test}.csv', 'bureau.csv', 'bureau_balance.csv',
'POS_CASH_balance.csv', 'credit_card_balance.csv',
'previous_application.csv', 'installments_payments.csv'],
dtype=object)
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
display(df.describe())
display(df.isna().sum())
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'application_train'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
datasets['application_train'].shape
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 106 columns
SK_ID_CURR 0
TARGET 0
NAME_CONTRACT_TYPE 0
CODE_GENDER 0
FLAG_OWN_CAR 0
...
AMT_REQ_CREDIT_BUREAU_DAY 41519
AMT_REQ_CREDIT_BUREAU_WEEK 41519
AMT_REQ_CREDIT_BUREAU_MON 41519
AMT_REQ_CREDIT_BUREAU_QRT 41519
AMT_REQ_CREDIT_BUREAU_YEAR 41519
Length: 122, dtype: int64
(307511, 122)
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | ... | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | ... | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | ... | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | ... | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
8 rows × 105 columns
SK_ID_CURR 0
NAME_CONTRACT_TYPE 0
CODE_GENDER 0
FLAG_OWN_CAR 0
FLAG_OWN_REALTY 0
...
AMT_REQ_CREDIT_BUREAU_DAY 6049
AMT_REQ_CREDIT_BUREAU_WEEK 6049
AMT_REQ_CREDIT_BUREAU_MON 6049
AMT_REQ_CREDIT_BUREAU_QRT 6049
AMT_REQ_CREDIT_BUREAU_YEAR 6049
Length: 121, dtype: int64
The application dataset has the most information about the client: Gender, income, family status, education ...
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 106 columns
SK_ID_CURR 0
TARGET 0
NAME_CONTRACT_TYPE 0
CODE_GENDER 0
FLAG_OWN_CAR 0
...
AMT_REQ_CREDIT_BUREAU_DAY 41519
AMT_REQ_CREDIT_BUREAU_WEEK 41519
AMT_REQ_CREDIT_BUREAU_MON 41519
AMT_REQ_CREDIT_BUREAU_QRT 41519
AMT_REQ_CREDIT_BUREAU_YEAR 41519
Length: 122, dtype: int64
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | ... | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | ... | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | ... | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | ... | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
8 rows × 105 columns
SK_ID_CURR 0
NAME_CONTRACT_TYPE 0
CODE_GENDER 0
FLAG_OWN_CAR 0
FLAG_OWN_REALTY 0
...
AMT_REQ_CREDIT_BUREAU_DAY 6049
AMT_REQ_CREDIT_BUREAU_WEEK 6049
AMT_REQ_CREDIT_BUREAU_MON 6049
AMT_REQ_CREDIT_BUREAU_QRT 6049
AMT_REQ_CREDIT_BUREAU_YEAR 6049
Length: 121, dtype: int64
bureau: shape is (1716428, 17) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB None
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
| SK_ID_CURR | SK_ID_BUREAU | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.716428e+06 | 1.716428e+06 | 1.716428e+06 | 1.716428e+06 | 1.610875e+06 | 1.082775e+06 | 5.919400e+05 | 1.716428e+06 | 1.716415e+06 | 1.458759e+06 | 1.124648e+06 | 1.716428e+06 | 1.716428e+06 | 4.896370e+05 |
| mean | 2.782149e+05 | 5.924434e+06 | -1.142108e+03 | 8.181666e-01 | 5.105174e+02 | -1.017437e+03 | 3.825418e+03 | 6.410406e-03 | 3.549946e+05 | 1.370851e+05 | 6.229515e+03 | 3.791276e+01 | -5.937483e+02 | 1.571276e+04 |
| std | 1.029386e+05 | 5.322657e+05 | 7.951649e+02 | 3.654443e+01 | 4.994220e+03 | 7.140106e+02 | 2.060316e+05 | 9.622391e-02 | 1.149811e+06 | 6.774011e+05 | 4.503203e+04 | 5.937650e+03 | 7.207473e+02 | 3.258269e+05 |
| min | 1.000010e+05 | 5.000000e+06 | -2.922000e+03 | 0.000000e+00 | -4.206000e+04 | -4.202300e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -4.705600e+06 | -5.864061e+05 | 0.000000e+00 | -4.194700e+04 | 0.000000e+00 |
| 25% | 1.888668e+05 | 5.463954e+06 | -1.666000e+03 | 0.000000e+00 | -1.138000e+03 | -1.489000e+03 | 0.000000e+00 | 0.000000e+00 | 5.130000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -9.080000e+02 | 0.000000e+00 |
| 50% | 2.780550e+05 | 5.926304e+06 | -9.870000e+02 | 0.000000e+00 | -3.300000e+02 | -8.970000e+02 | 0.000000e+00 | 0.000000e+00 | 1.255185e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -3.950000e+02 | 0.000000e+00 |
| 75% | 3.674260e+05 | 6.385681e+06 | -4.740000e+02 | 0.000000e+00 | 4.740000e+02 | -4.250000e+02 | 0.000000e+00 | 0.000000e+00 | 3.150000e+05 | 4.015350e+04 | 0.000000e+00 | 0.000000e+00 | -3.300000e+01 | 1.350000e+04 |
| max | 4.562550e+05 | 6.843457e+06 | 0.000000e+00 | 2.792000e+03 | 3.119900e+04 | 0.000000e+00 | 1.159872e+08 | 9.000000e+00 | 5.850000e+08 | 1.701000e+08 | 4.705600e+06 | 3.756681e+06 | 3.720000e+02 | 1.184534e+08 |
SK_ID_CURR 0 SK_ID_BUREAU 0 CREDIT_ACTIVE 0 CREDIT_CURRENCY 0 DAYS_CREDIT 0 CREDIT_DAY_OVERDUE 0 DAYS_CREDIT_ENDDATE 105553 DAYS_ENDDATE_FACT 633653 AMT_CREDIT_MAX_OVERDUE 1124488 CNT_CREDIT_PROLONG 0 AMT_CREDIT_SUM 13 AMT_CREDIT_SUM_DEBT 257669 AMT_CREDIT_SUM_LIMIT 591780 AMT_CREDIT_SUM_OVERDUE 0 CREDIT_TYPE 0 DAYS_CREDIT_UPDATE 0 AMT_ANNUITY 1226791 dtype: int64
bureau_balance: shape is (27299925, 3) <class 'pandas.core.frame.DataFrame'> RangeIndex: 27299925 entries, 0 to 27299924 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE int64 2 STATUS object dtypes: int64(2), object(1) memory usage: 624.8+ MB None
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
| SK_ID_BUREAU | MONTHS_BALANCE | |
|---|---|---|
| count | 2.729992e+07 | 2.729992e+07 |
| mean | 6.036297e+06 | -3.074169e+01 |
| std | 4.923489e+05 | 2.386451e+01 |
| min | 5.001709e+06 | -9.600000e+01 |
| 25% | 5.730933e+06 | -4.600000e+01 |
| 50% | 6.070821e+06 | -2.500000e+01 |
| 75% | 6.431951e+06 | -1.100000e+01 |
| max | 6.842888e+06 | 0.000000e+00 |
SK_ID_BUREAU 0 MONTHS_BALANCE 0 STATUS 0 dtype: int64
credit_card_balance: shape is (3840312, 23) <class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | ... | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | ... | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | ... | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | ... | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | ... | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
5 rows × 23 columns
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.090496e+06 | 3.840312e+06 | 3.090496e+06 | 3.090496e+06 | 3.535076e+06 | ... | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.090496e+06 | 3.840312e+06 | 3.090496e+06 | 3.090496e+06 | 3.535076e+06 | 3.840312e+06 | 3.840312e+06 |
| mean | 1.904504e+06 | 2.783242e+05 | -3.452192e+01 | 5.830016e+04 | 1.538080e+05 | 5.961325e+03 | 7.433388e+03 | 2.881696e+02 | 2.968805e+03 | 3.540204e+03 | ... | 5.596588e+04 | 5.808881e+04 | 5.809829e+04 | 3.094490e-01 | 7.031439e-01 | 4.812496e-03 | 5.594791e-01 | 2.082508e+01 | 9.283667e+00 | 3.316220e-01 |
| std | 5.364695e+05 | 1.027045e+05 | 2.666775e+01 | 1.063070e+05 | 1.651457e+05 | 2.822569e+04 | 3.384608e+04 | 8.201989e+03 | 2.079689e+04 | 5.600154e+03 | ... | 1.025336e+05 | 1.059654e+05 | 1.059718e+05 | 1.100401e+00 | 3.190347e+00 | 8.263861e-02 | 3.240649e+00 | 2.005149e+01 | 9.751570e+01 | 2.147923e+01 |
| min | 1.000018e+06 | 1.000060e+05 | -9.600000e+01 | -4.202502e+05 | 0.000000e+00 | -6.827310e+03 | -6.211620e+03 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | -4.233058e+05 | -4.202502e+05 | -4.202502e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434385e+06 | 1.895170e+05 | -5.500000e+01 | 0.000000e+00 | 4.500000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 1.897122e+06 | 2.783960e+05 | -2.800000e+01 | 0.000000e+00 | 1.125000e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.500000e+01 | 0.000000e+00 | 0.000000e+00 |
| 75% | 2.369328e+06 | 3.675800e+05 | -1.100000e+01 | 8.904669e+04 | 1.800000e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.633911e+03 | ... | 8.535924e+04 | 8.889949e+04 | 8.891451e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.200000e+01 | 0.000000e+00 | 0.000000e+00 |
| max | 2.843496e+06 | 4.562500e+05 | -1.000000e+00 | 1.505902e+06 | 1.350000e+06 | 2.115000e+06 | 2.287098e+06 | 1.529847e+06 | 2.239274e+06 | 2.028820e+05 | ... | 1.472317e+06 | 1.493338e+06 | 1.493338e+06 | 5.100000e+01 | 1.650000e+02 | 1.200000e+01 | 1.650000e+02 | 1.200000e+02 | 3.260000e+03 | 3.260000e+03 |
8 rows × 22 columns
SK_ID_PREV 0 SK_ID_CURR 0 MONTHS_BALANCE 0 AMT_BALANCE 0 AMT_CREDIT_LIMIT_ACTUAL 0 AMT_DRAWINGS_ATM_CURRENT 749816 AMT_DRAWINGS_CURRENT 0 AMT_DRAWINGS_OTHER_CURRENT 749816 AMT_DRAWINGS_POS_CURRENT 749816 AMT_INST_MIN_REGULARITY 305236 AMT_PAYMENT_CURRENT 767988 AMT_PAYMENT_TOTAL_CURRENT 0 AMT_RECEIVABLE_PRINCIPAL 0 AMT_RECIVABLE 0 AMT_TOTAL_RECEIVABLE 0 CNT_DRAWINGS_ATM_CURRENT 749816 CNT_DRAWINGS_CURRENT 0 CNT_DRAWINGS_OTHER_CURRENT 749816 CNT_DRAWINGS_POS_CURRENT 749816 CNT_INSTALMENT_MATURE_CUM 305236 NAME_CONTRACT_STATUS 0 SK_DPD 0 SK_DPD_DEF 0 dtype: int64
installments_payments: shape is (13605401, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB None
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| count | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360250e+07 | 1.360540e+07 | 1.360250e+07 |
| mean | 1.903365e+06 | 2.784449e+05 | 8.566373e-01 | 1.887090e+01 | -1.042270e+03 | -1.051114e+03 | 1.705091e+04 | 1.723822e+04 |
| std | 5.362029e+05 | 1.027183e+05 | 1.035216e+00 | 2.666407e+01 | 8.009463e+02 | 8.005859e+02 | 5.057025e+04 | 5.473578e+04 |
| min | 1.000001e+06 | 1.000010e+05 | 0.000000e+00 | 1.000000e+00 | -2.922000e+03 | -4.921000e+03 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434191e+06 | 1.896390e+05 | 0.000000e+00 | 4.000000e+00 | -1.654000e+03 | -1.662000e+03 | 4.226085e+03 | 3.398265e+03 |
| 50% | 1.896520e+06 | 2.786850e+05 | 1.000000e+00 | 8.000000e+00 | -8.180000e+02 | -8.270000e+02 | 8.884080e+03 | 8.125515e+03 |
| 75% | 2.369094e+06 | 3.675300e+05 | 1.000000e+00 | 1.900000e+01 | -3.610000e+02 | -3.700000e+02 | 1.671021e+04 | 1.610842e+04 |
| max | 2.843499e+06 | 4.562550e+05 | 1.780000e+02 | 2.770000e+02 | -1.000000e+00 | -1.000000e+00 | 3.771488e+06 | 3.771488e+06 |
SK_ID_PREV 0 SK_ID_CURR 0 NUM_INSTALMENT_VERSION 0 NUM_INSTALMENT_NUMBER 0 DAYS_INSTALMENT 0 DAYS_ENTRY_PAYMENT 2905 AMT_INSTALMENT 0 AMT_PAYMENT 2905 dtype: int64
previous_application: shape is (1670214, 37) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB None
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
| SK_ID_PREV | SK_ID_CURR | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | ... | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.670214e+06 | 1.670214e+06 | 1.297979e+06 | 1.670214e+06 | 1.670213e+06 | 7.743700e+05 | 1.284699e+06 | 1.670214e+06 | 1.670214e+06 | 774370.000000 | ... | 5951.000000 | 1.670214e+06 | 1.670214e+06 | 1.297984e+06 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 |
| mean | 1.923089e+06 | 2.783572e+05 | 1.595512e+04 | 1.752339e+05 | 1.961140e+05 | 6.697402e+03 | 2.278473e+05 | 1.248418e+01 | 9.964675e-01 | 0.079637 | ... | 0.773503 | -8.806797e+02 | 3.139511e+02 | 1.605408e+01 | 342209.855039 | 13826.269337 | 33767.774054 | 76582.403064 | 81992.343838 | 0.332570 |
| std | 5.325980e+05 | 1.028148e+05 | 1.478214e+04 | 2.927798e+05 | 3.185746e+05 | 2.092150e+04 | 3.153966e+05 | 3.334028e+00 | 5.932963e-02 | 0.107823 | ... | 0.100879 | 7.790997e+02 | 7.127443e+03 | 1.456729e+01 | 88916.115834 | 72444.869708 | 106857.034789 | 149647.415123 | 153303.516729 | 0.471134 |
| min | 1.000001e+06 | 1.000010e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -9.000000e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -0.000015 | ... | 0.373150 | -2.922000e+03 | -1.000000e+00 | 0.000000e+00 | -2922.000000 | -2892.000000 | -2801.000000 | -2889.000000 | -2874.000000 | 0.000000 |
| 25% | 1.461857e+06 | 1.893290e+05 | 6.321780e+03 | 1.872000e+04 | 2.416050e+04 | 0.000000e+00 | 5.084100e+04 | 1.000000e+01 | 1.000000e+00 | 0.000000 | ... | 0.715645 | -1.300000e+03 | -1.000000e+00 | 6.000000e+00 | 365243.000000 | -1628.000000 | -1242.000000 | -1314.000000 | -1270.000000 | 0.000000 |
| 50% | 1.923110e+06 | 2.787145e+05 | 1.125000e+04 | 7.104600e+04 | 8.054100e+04 | 1.638000e+03 | 1.123200e+05 | 1.200000e+01 | 1.000000e+00 | 0.051605 | ... | 0.835095 | -5.810000e+02 | 3.000000e+00 | 1.200000e+01 | 365243.000000 | -831.000000 | -361.000000 | -537.000000 | -499.000000 | 0.000000 |
| 75% | 2.384280e+06 | 3.675140e+05 | 2.065842e+04 | 1.803600e+05 | 2.164185e+05 | 7.740000e+03 | 2.340000e+05 | 1.500000e+01 | 1.000000e+00 | 0.108909 | ... | 0.852537 | -2.800000e+02 | 8.200000e+01 | 2.400000e+01 | 365243.000000 | -411.000000 | 129.000000 | -74.000000 | -44.000000 | 1.000000 |
| max | 2.845382e+06 | 4.562550e+05 | 4.180581e+05 | 6.905160e+06 | 6.905160e+06 | 3.060045e+06 | 6.905160e+06 | 2.300000e+01 | 1.000000e+00 | 1.000000 | ... | 1.000000 | -1.000000e+00 | 4.000000e+06 | 8.400000e+01 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 1.000000 |
8 rows × 21 columns
SK_ID_PREV 0 SK_ID_CURR 0 NAME_CONTRACT_TYPE 0 AMT_ANNUITY 372235 AMT_APPLICATION 0 AMT_CREDIT 1 AMT_DOWN_PAYMENT 895844 AMT_GOODS_PRICE 385515 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 FLAG_LAST_APPL_PER_CONTRACT 0 NFLAG_LAST_APPL_IN_DAY 0 RATE_DOWN_PAYMENT 895844 RATE_INTEREST_PRIMARY 1664263 RATE_INTEREST_PRIVILEGED 1664263 NAME_CASH_LOAN_PURPOSE 0 NAME_CONTRACT_STATUS 0 DAYS_DECISION 0 NAME_PAYMENT_TYPE 0 CODE_REJECT_REASON 0 NAME_TYPE_SUITE 820405 NAME_CLIENT_TYPE 0 NAME_GOODS_CATEGORY 0 NAME_PORTFOLIO 0 NAME_PRODUCT_TYPE 0 CHANNEL_TYPE 0 SELLERPLACE_AREA 0 NAME_SELLER_INDUSTRY 0 CNT_PAYMENT 372230 NAME_YIELD_GROUP 0 PRODUCT_COMBINATION 346 DAYS_FIRST_DRAWING 673065 DAYS_FIRST_DUE 673065 DAYS_LAST_DUE_1ST_VERSION 673065 DAYS_LAST_DUE 673065 DAYS_TERMINATION 673065 NFLAG_INSURED_ON_APPROVAL 673065 dtype: int64
POS_CASH_balance: shape is (10001358, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|
| count | 1.000136e+07 | 1.000136e+07 | 1.000136e+07 | 9.975287e+06 | 9.975271e+06 | 1.000136e+07 | 1.000136e+07 |
| mean | 1.903217e+06 | 2.784039e+05 | -3.501259e+01 | 1.708965e+01 | 1.048384e+01 | 1.160693e+01 | 6.544684e-01 |
| std | 5.358465e+05 | 1.027637e+05 | 2.606657e+01 | 1.199506e+01 | 1.110906e+01 | 1.327140e+02 | 3.276249e+01 |
| min | 1.000001e+06 | 1.000010e+05 | -9.600000e+01 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434405e+06 | 1.895500e+05 | -5.400000e+01 | 1.000000e+01 | 3.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 1.896565e+06 | 2.786540e+05 | -2.800000e+01 | 1.200000e+01 | 7.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 75% | 2.368963e+06 | 3.674290e+05 | -1.300000e+01 | 2.400000e+01 | 1.400000e+01 | 0.000000e+00 | 0.000000e+00 |
| max | 2.843499e+06 | 4.562550e+05 | -1.000000e+00 | 9.200000e+01 | 8.500000e+01 | 4.231000e+03 | 3.595000e+03 |
SK_ID_PREV 0 SK_ID_CURR 0 MONTHS_BALANCE 0 CNT_INSTALMENT 26071 CNT_INSTALMENT_FUTURE 26087 NAME_CONTRACT_STATUS 0 SK_DPD 0 SK_DPD_DEF 0 dtype: int64
CPU times: user 58.5 s, sys: 5.84 s, total: 1min 4s Wall time: 1min 4s
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
Below cells are redundant and are added to quickly load all datasets in events like kernel failure.
import gc
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import set_config
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_regression, chi2, r_regression
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler, Normalizer, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings('ignore')
from google.colab import drive,files
drive.mount('/content/gdrive')
# Google collab dir: Account: kikarand@iu.edu
DATA_DIR = "gdrive/MyDrive/data/"
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
display(df.describe())
display(df.isna().sum())
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
pa = datasets["previous_application"]
ip = datasets["installments_payments"]
pcb = datasets["POS_CASH_balance"]
ccb = datasets["credit_card_balance"]
bur = datasets["bureau"]
bb = datasets["bureau_balance"]
appsDF = datasets["previous_application"]
Mounted at /content/gdrive application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 106 columns
SK_ID_CURR 0
TARGET 0
NAME_CONTRACT_TYPE 0
CODE_GENDER 0
FLAG_OWN_CAR 0
...
AMT_REQ_CREDIT_BUREAU_DAY 41519
AMT_REQ_CREDIT_BUREAU_WEEK 41519
AMT_REQ_CREDIT_BUREAU_MON 41519
AMT_REQ_CREDIT_BUREAU_QRT 41519
AMT_REQ_CREDIT_BUREAU_YEAR 41519
Length: 122, dtype: int64
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | ... | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | ... | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | ... | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | ... | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
8 rows × 105 columns
SK_ID_CURR 0
NAME_CONTRACT_TYPE 0
CODE_GENDER 0
FLAG_OWN_CAR 0
FLAG_OWN_REALTY 0
...
AMT_REQ_CREDIT_BUREAU_DAY 6049
AMT_REQ_CREDIT_BUREAU_WEEK 6049
AMT_REQ_CREDIT_BUREAU_MON 6049
AMT_REQ_CREDIT_BUREAU_QRT 6049
AMT_REQ_CREDIT_BUREAU_YEAR 6049
Length: 121, dtype: int64
bureau: shape is (1716428, 17) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB None
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
| SK_ID_CURR | SK_ID_BUREAU | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.716428e+06 | 1.716428e+06 | 1.716428e+06 | 1.716428e+06 | 1.610875e+06 | 1.082775e+06 | 5.919400e+05 | 1.716428e+06 | 1.716415e+06 | 1.458759e+06 | 1.124648e+06 | 1.716428e+06 | 1.716428e+06 | 4.896370e+05 |
| mean | 2.782149e+05 | 5.924434e+06 | -1.142108e+03 | 8.181666e-01 | 5.105174e+02 | -1.017437e+03 | 3.825418e+03 | 6.410406e-03 | 3.549946e+05 | 1.370851e+05 | 6.229515e+03 | 3.791276e+01 | -5.937483e+02 | 1.571276e+04 |
| std | 1.029386e+05 | 5.322657e+05 | 7.951649e+02 | 3.654443e+01 | 4.994220e+03 | 7.140106e+02 | 2.060316e+05 | 9.622391e-02 | 1.149811e+06 | 6.774011e+05 | 4.503203e+04 | 5.937650e+03 | 7.207473e+02 | 3.258269e+05 |
| min | 1.000010e+05 | 5.000000e+06 | -2.922000e+03 | 0.000000e+00 | -4.206000e+04 | -4.202300e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -4.705600e+06 | -5.864061e+05 | 0.000000e+00 | -4.194700e+04 | 0.000000e+00 |
| 25% | 1.888668e+05 | 5.463954e+06 | -1.666000e+03 | 0.000000e+00 | -1.138000e+03 | -1.489000e+03 | 0.000000e+00 | 0.000000e+00 | 5.130000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -9.080000e+02 | 0.000000e+00 |
| 50% | 2.780550e+05 | 5.926304e+06 | -9.870000e+02 | 0.000000e+00 | -3.300000e+02 | -8.970000e+02 | 0.000000e+00 | 0.000000e+00 | 1.255185e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -3.950000e+02 | 0.000000e+00 |
| 75% | 3.674260e+05 | 6.385681e+06 | -4.740000e+02 | 0.000000e+00 | 4.740000e+02 | -4.250000e+02 | 0.000000e+00 | 0.000000e+00 | 3.150000e+05 | 4.015350e+04 | 0.000000e+00 | 0.000000e+00 | -3.300000e+01 | 1.350000e+04 |
| max | 4.562550e+05 | 6.843457e+06 | 0.000000e+00 | 2.792000e+03 | 3.119900e+04 | 0.000000e+00 | 1.159872e+08 | 9.000000e+00 | 5.850000e+08 | 1.701000e+08 | 4.705600e+06 | 3.756681e+06 | 3.720000e+02 | 1.184534e+08 |
SK_ID_CURR 0 SK_ID_BUREAU 0 CREDIT_ACTIVE 0 CREDIT_CURRENCY 0 DAYS_CREDIT 0 CREDIT_DAY_OVERDUE 0 DAYS_CREDIT_ENDDATE 105553 DAYS_ENDDATE_FACT 633653 AMT_CREDIT_MAX_OVERDUE 1124488 CNT_CREDIT_PROLONG 0 AMT_CREDIT_SUM 13 AMT_CREDIT_SUM_DEBT 257669 AMT_CREDIT_SUM_LIMIT 591780 AMT_CREDIT_SUM_OVERDUE 0 CREDIT_TYPE 0 DAYS_CREDIT_UPDATE 0 AMT_ANNUITY 1226791 dtype: int64
bureau_balance: shape is (27299925, 3) <class 'pandas.core.frame.DataFrame'> RangeIndex: 27299925 entries, 0 to 27299924 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE int64 2 STATUS object dtypes: int64(2), object(1) memory usage: 624.8+ MB None
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
| SK_ID_BUREAU | MONTHS_BALANCE | |
|---|---|---|
| count | 2.729992e+07 | 2.729992e+07 |
| mean | 6.036297e+06 | -3.074169e+01 |
| std | 4.923489e+05 | 2.386451e+01 |
| min | 5.001709e+06 | -9.600000e+01 |
| 25% | 5.730933e+06 | -4.600000e+01 |
| 50% | 6.070821e+06 | -2.500000e+01 |
| 75% | 6.431951e+06 | -1.100000e+01 |
| max | 6.842888e+06 | 0.000000e+00 |
SK_ID_BUREAU 0 MONTHS_BALANCE 0 STATUS 0 dtype: int64
credit_card_balance: shape is (3840312, 23) <class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | ... | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | ... | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | ... | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | ... | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | ... | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
5 rows × 23 columns
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.090496e+06 | 3.840312e+06 | 3.090496e+06 | 3.090496e+06 | 3.535076e+06 | ... | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.090496e+06 | 3.840312e+06 | 3.090496e+06 | 3.090496e+06 | 3.535076e+06 | 3.840312e+06 | 3.840312e+06 |
| mean | 1.904504e+06 | 2.783242e+05 | -3.452192e+01 | 5.830016e+04 | 1.538080e+05 | 5.961325e+03 | 7.433388e+03 | 2.881696e+02 | 2.968805e+03 | 3.540204e+03 | ... | 5.596588e+04 | 5.808881e+04 | 5.809829e+04 | 3.094490e-01 | 7.031439e-01 | 4.812496e-03 | 5.594791e-01 | 2.082508e+01 | 9.283667e+00 | 3.316220e-01 |
| std | 5.364695e+05 | 1.027045e+05 | 2.666775e+01 | 1.063070e+05 | 1.651457e+05 | 2.822569e+04 | 3.384608e+04 | 8.201989e+03 | 2.079689e+04 | 5.600154e+03 | ... | 1.025336e+05 | 1.059654e+05 | 1.059718e+05 | 1.100401e+00 | 3.190347e+00 | 8.263861e-02 | 3.240649e+00 | 2.005149e+01 | 9.751570e+01 | 2.147923e+01 |
| min | 1.000018e+06 | 1.000060e+05 | -9.600000e+01 | -4.202502e+05 | 0.000000e+00 | -6.827310e+03 | -6.211620e+03 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | -4.233058e+05 | -4.202502e+05 | -4.202502e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434385e+06 | 1.895170e+05 | -5.500000e+01 | 0.000000e+00 | 4.500000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 1.897122e+06 | 2.783960e+05 | -2.800000e+01 | 0.000000e+00 | 1.125000e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.500000e+01 | 0.000000e+00 | 0.000000e+00 |
| 75% | 2.369328e+06 | 3.675800e+05 | -1.100000e+01 | 8.904669e+04 | 1.800000e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.633911e+03 | ... | 8.535924e+04 | 8.889949e+04 | 8.891451e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.200000e+01 | 0.000000e+00 | 0.000000e+00 |
| max | 2.843496e+06 | 4.562500e+05 | -1.000000e+00 | 1.505902e+06 | 1.350000e+06 | 2.115000e+06 | 2.287098e+06 | 1.529847e+06 | 2.239274e+06 | 2.028820e+05 | ... | 1.472317e+06 | 1.493338e+06 | 1.493338e+06 | 5.100000e+01 | 1.650000e+02 | 1.200000e+01 | 1.650000e+02 | 1.200000e+02 | 3.260000e+03 | 3.260000e+03 |
8 rows × 22 columns
SK_ID_PREV 0 SK_ID_CURR 0 MONTHS_BALANCE 0 AMT_BALANCE 0 AMT_CREDIT_LIMIT_ACTUAL 0 AMT_DRAWINGS_ATM_CURRENT 749816 AMT_DRAWINGS_CURRENT 0 AMT_DRAWINGS_OTHER_CURRENT 749816 AMT_DRAWINGS_POS_CURRENT 749816 AMT_INST_MIN_REGULARITY 305236 AMT_PAYMENT_CURRENT 767988 AMT_PAYMENT_TOTAL_CURRENT 0 AMT_RECEIVABLE_PRINCIPAL 0 AMT_RECIVABLE 0 AMT_TOTAL_RECEIVABLE 0 CNT_DRAWINGS_ATM_CURRENT 749816 CNT_DRAWINGS_CURRENT 0 CNT_DRAWINGS_OTHER_CURRENT 749816 CNT_DRAWINGS_POS_CURRENT 749816 CNT_INSTALMENT_MATURE_CUM 305236 NAME_CONTRACT_STATUS 0 SK_DPD 0 SK_DPD_DEF 0 dtype: int64
installments_payments: shape is (13605401, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB None
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| count | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360250e+07 | 1.360540e+07 | 1.360250e+07 |
| mean | 1.903365e+06 | 2.784449e+05 | 8.566373e-01 | 1.887090e+01 | -1.042270e+03 | -1.051114e+03 | 1.705091e+04 | 1.723822e+04 |
| std | 5.362029e+05 | 1.027183e+05 | 1.035216e+00 | 2.666407e+01 | 8.009463e+02 | 8.005859e+02 | 5.057025e+04 | 5.473578e+04 |
| min | 1.000001e+06 | 1.000010e+05 | 0.000000e+00 | 1.000000e+00 | -2.922000e+03 | -4.921000e+03 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434191e+06 | 1.896390e+05 | 0.000000e+00 | 4.000000e+00 | -1.654000e+03 | -1.662000e+03 | 4.226085e+03 | 3.398265e+03 |
| 50% | 1.896520e+06 | 2.786850e+05 | 1.000000e+00 | 8.000000e+00 | -8.180000e+02 | -8.270000e+02 | 8.884080e+03 | 8.125515e+03 |
| 75% | 2.369094e+06 | 3.675300e+05 | 1.000000e+00 | 1.900000e+01 | -3.610000e+02 | -3.700000e+02 | 1.671021e+04 | 1.610842e+04 |
| max | 2.843499e+06 | 4.562550e+05 | 1.780000e+02 | 2.770000e+02 | -1.000000e+00 | -1.000000e+00 | 3.771488e+06 | 3.771488e+06 |
SK_ID_PREV 0 SK_ID_CURR 0 NUM_INSTALMENT_VERSION 0 NUM_INSTALMENT_NUMBER 0 DAYS_INSTALMENT 0 DAYS_ENTRY_PAYMENT 2905 AMT_INSTALMENT 0 AMT_PAYMENT 2905 dtype: int64
previous_application: shape is (1670214, 37) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB None
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
| SK_ID_PREV | SK_ID_CURR | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | ... | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.670214e+06 | 1.670214e+06 | 1.297979e+06 | 1.670214e+06 | 1.670213e+06 | 7.743700e+05 | 1.284699e+06 | 1.670214e+06 | 1.670214e+06 | 774370.000000 | ... | 5951.000000 | 1.670214e+06 | 1.670214e+06 | 1.297984e+06 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 |
| mean | 1.923089e+06 | 2.783572e+05 | 1.595512e+04 | 1.752339e+05 | 1.961140e+05 | 6.697402e+03 | 2.278473e+05 | 1.248418e+01 | 9.964675e-01 | 0.079637 | ... | 0.773503 | -8.806797e+02 | 3.139511e+02 | 1.605408e+01 | 342209.855039 | 13826.269337 | 33767.774054 | 76582.403064 | 81992.343838 | 0.332570 |
| std | 5.325980e+05 | 1.028148e+05 | 1.478214e+04 | 2.927798e+05 | 3.185746e+05 | 2.092150e+04 | 3.153966e+05 | 3.334028e+00 | 5.932963e-02 | 0.107823 | ... | 0.100879 | 7.790997e+02 | 7.127443e+03 | 1.456729e+01 | 88916.115834 | 72444.869708 | 106857.034789 | 149647.415123 | 153303.516729 | 0.471134 |
| min | 1.000001e+06 | 1.000010e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -9.000000e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -0.000015 | ... | 0.373150 | -2.922000e+03 | -1.000000e+00 | 0.000000e+00 | -2922.000000 | -2892.000000 | -2801.000000 | -2889.000000 | -2874.000000 | 0.000000 |
| 25% | 1.461857e+06 | 1.893290e+05 | 6.321780e+03 | 1.872000e+04 | 2.416050e+04 | 0.000000e+00 | 5.084100e+04 | 1.000000e+01 | 1.000000e+00 | 0.000000 | ... | 0.715645 | -1.300000e+03 | -1.000000e+00 | 6.000000e+00 | 365243.000000 | -1628.000000 | -1242.000000 | -1314.000000 | -1270.000000 | 0.000000 |
| 50% | 1.923110e+06 | 2.787145e+05 | 1.125000e+04 | 7.104600e+04 | 8.054100e+04 | 1.638000e+03 | 1.123200e+05 | 1.200000e+01 | 1.000000e+00 | 0.051605 | ... | 0.835095 | -5.810000e+02 | 3.000000e+00 | 1.200000e+01 | 365243.000000 | -831.000000 | -361.000000 | -537.000000 | -499.000000 | 0.000000 |
| 75% | 2.384280e+06 | 3.675140e+05 | 2.065842e+04 | 1.803600e+05 | 2.164185e+05 | 7.740000e+03 | 2.340000e+05 | 1.500000e+01 | 1.000000e+00 | 0.108909 | ... | 0.852537 | -2.800000e+02 | 8.200000e+01 | 2.400000e+01 | 365243.000000 | -411.000000 | 129.000000 | -74.000000 | -44.000000 | 1.000000 |
| max | 2.845382e+06 | 4.562550e+05 | 4.180581e+05 | 6.905160e+06 | 6.905160e+06 | 3.060045e+06 | 6.905160e+06 | 2.300000e+01 | 1.000000e+00 | 1.000000 | ... | 1.000000 | -1.000000e+00 | 4.000000e+06 | 8.400000e+01 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 1.000000 |
8 rows × 21 columns
SK_ID_PREV 0 SK_ID_CURR 0 NAME_CONTRACT_TYPE 0 AMT_ANNUITY 372235 AMT_APPLICATION 0 AMT_CREDIT 1 AMT_DOWN_PAYMENT 895844 AMT_GOODS_PRICE 385515 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 FLAG_LAST_APPL_PER_CONTRACT 0 NFLAG_LAST_APPL_IN_DAY 0 RATE_DOWN_PAYMENT 895844 RATE_INTEREST_PRIMARY 1664263 RATE_INTEREST_PRIVILEGED 1664263 NAME_CASH_LOAN_PURPOSE 0 NAME_CONTRACT_STATUS 0 DAYS_DECISION 0 NAME_PAYMENT_TYPE 0 CODE_REJECT_REASON 0 NAME_TYPE_SUITE 820405 NAME_CLIENT_TYPE 0 NAME_GOODS_CATEGORY 0 NAME_PORTFOLIO 0 NAME_PRODUCT_TYPE 0 CHANNEL_TYPE 0 SELLERPLACE_AREA 0 NAME_SELLER_INDUSTRY 0 CNT_PAYMENT 372230 NAME_YIELD_GROUP 0 PRODUCT_COMBINATION 346 DAYS_FIRST_DRAWING 673065 DAYS_FIRST_DUE 673065 DAYS_LAST_DUE_1ST_VERSION 673065 DAYS_LAST_DUE 673065 DAYS_TERMINATION 673065 NFLAG_INSURED_ON_APPROVAL 673065 dtype: int64
POS_CASH_balance: shape is (10001358, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|
| count | 1.000136e+07 | 1.000136e+07 | 1.000136e+07 | 9.975287e+06 | 9.975271e+06 | 1.000136e+07 | 1.000136e+07 |
| mean | 1.903217e+06 | 2.784039e+05 | -3.501259e+01 | 1.708965e+01 | 1.048384e+01 | 1.160693e+01 | 6.544684e-01 |
| std | 5.358465e+05 | 1.027637e+05 | 2.606657e+01 | 1.199506e+01 | 1.110906e+01 | 1.327140e+02 | 3.276249e+01 |
| min | 1.000001e+06 | 1.000010e+05 | -9.600000e+01 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434405e+06 | 1.895500e+05 | -5.400000e+01 | 1.000000e+01 | 3.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 1.896565e+06 | 2.786540e+05 | -2.800000e+01 | 1.200000e+01 | 7.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 75% | 2.368963e+06 | 3.674290e+05 | -1.300000e+01 | 2.400000e+01 | 1.400000e+01 | 0.000000e+00 | 0.000000e+00 |
| max | 2.843499e+06 | 4.562550e+05 | -1.000000e+00 | 9.200000e+01 | 8.500000e+01 | 4.231000e+03 | 3.595000e+03 |
SK_ID_PREV 0 SK_ID_CURR 0 MONTHS_BALANCE 0 CNT_INSTALMENT 26071 CNT_INSTALMENT_FUTURE 26087 NAME_CONTRACT_STATUS 0 SK_DPD 0 SK_DPD_DEF 0 dtype: int64
dataset application_train : [ 307,511, 122] dataset application_test : [ 48,744, 121] dataset bureau : [ 1,716,428, 17] dataset bureau_balance : [ 27,299,925, 3] dataset credit_card_balance : [ 3,840,312, 23] dataset installments_payments : [ 13,605,401, 8] dataset previous_application : [ 1,670,214, 37] dataset POS_CASH_balance : [ 10,001,358, 8]
datasets["application_train"].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
datasets["application_train"].describe() #numerical only features
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 106 columns
datasets["application_test"].describe() #numerical only features
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | ... | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | ... | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | ... | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | ... | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
8 rows × 105 columns
datasets["application_train"].describe(include='all') #look at all categorical and numerical
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511 | 307511 | 307511 | 307511 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| unique | NaN | NaN | 2 | 3 | 2 | 2 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | Cash loans | F | N | Y | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | 278232 | 202448 | 202924 | 213312 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 278180.518577 | 0.080729 | NaN | NaN | NaN | NaN | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | NaN | NaN | NaN | NaN | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | NaN | NaN | NaN | NaN | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | NaN | NaN | NaN | NaN | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
11 rows × 122 columns
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(40)
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_MEDI | 69.43 | 213514 |
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| LIVINGAPARTMENTS_MODE | 68.35 | 210199 |
| LIVINGAPARTMENTS_AVG | 68.35 | 210199 |
| LIVINGAPARTMENTS_MEDI | 68.35 | 210199 |
| FLOORSMIN_AVG | 67.85 | 208642 |
| FLOORSMIN_MODE | 67.85 | 208642 |
| FLOORSMIN_MEDI | 67.85 | 208642 |
| YEARS_BUILD_MEDI | 66.50 | 204488 |
| YEARS_BUILD_MODE | 66.50 | 204488 |
| YEARS_BUILD_AVG | 66.50 | 204488 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_MEDI | 59.38 | 182590 |
| LANDAREA_MODE | 59.38 | 182590 |
| LANDAREA_AVG | 59.38 | 182590 |
| BASEMENTAREA_MEDI | 58.52 | 179943 |
| BASEMENTAREA_AVG | 58.52 | 179943 |
| BASEMENTAREA_MODE | 58.52 | 179943 |
| EXT_SOURCE_1 | 56.38 | 173378 |
| NONLIVINGAREA_MODE | 55.18 | 169682 |
| NONLIVINGAREA_AVG | 55.18 | 169682 |
| NONLIVINGAREA_MEDI | 55.18 | 169682 |
| ELEVATORS_MEDI | 53.30 | 163891 |
| ELEVATORS_AVG | 53.30 | 163891 |
| ELEVATORS_MODE | 53.30 | 163891 |
| WALLSMATERIAL_MODE | 50.84 | 156341 |
| APARTMENTS_MEDI | 50.75 | 156061 |
| APARTMENTS_AVG | 50.75 | 156061 |
| APARTMENTS_MODE | 50.75 | 156061 |
| ENTRANCES_MEDI | 50.35 | 154828 |
| ENTRANCES_AVG | 50.35 | 154828 |
| ENTRANCES_MODE | 50.35 | 154828 |
| LIVINGAREA_AVG | 50.19 | 154350 |
| LIVINGAREA_MODE | 50.19 | 154350 |
| LIVINGAREA_MEDI | 50.19 | 154350 |
percent = (datasets["application_test"].isnull().sum()/datasets["application_test"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_test"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
| Percent | Test Missing Count | |
|---|---|---|
| COMMONAREA_AVG | 68.72 | 33495 |
| COMMONAREA_MODE | 68.72 | 33495 |
| COMMONAREA_MEDI | 68.72 | 33495 |
| NONLIVINGAPARTMENTS_AVG | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MODE | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MEDI | 68.41 | 33347 |
| FONDKAPREMONT_MODE | 67.28 | 32797 |
| LIVINGAPARTMENTS_AVG | 67.25 | 32780 |
| LIVINGAPARTMENTS_MODE | 67.25 | 32780 |
| LIVINGAPARTMENTS_MEDI | 67.25 | 32780 |
| FLOORSMIN_MEDI | 66.61 | 32466 |
| FLOORSMIN_AVG | 66.61 | 32466 |
| FLOORSMIN_MODE | 66.61 | 32466 |
| OWN_CAR_AGE | 66.29 | 32312 |
| YEARS_BUILD_AVG | 65.28 | 31818 |
| YEARS_BUILD_MEDI | 65.28 | 31818 |
| YEARS_BUILD_MODE | 65.28 | 31818 |
| LANDAREA_MEDI | 57.96 | 28254 |
| LANDAREA_AVG | 57.96 | 28254 |
| LANDAREA_MODE | 57.96 | 28254 |
# datasets["application_train"]['TARGET'].astype(int).plot.hist();
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
ax.set_title("Credit Application vs Taget : Count Plot")
sns.countplot(datasets["application_train"]['TARGET'], ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x7fae2c9cf810>
application_train_corr = datasets["application_train"].corr()
sorted_target = application_train_corr["TARGET"].sort_values()
tail_10 = sorted_target.tail(10)
head_10 = sorted_target.head(10)
print('Most Positive Correlations:\n', tail_10)
print('\nMost Negative Correlations:\n', head_10)
Most Positive Correlations: FLAG_DOCUMENT_3 0.044346 REG_CITY_NOT_LIVE_CITY 0.044395 FLAG_EMP_PHONE 0.045982 REG_CITY_NOT_WORK_CITY 0.050994 DAYS_ID_PUBLISH 0.051457 DAYS_LAST_PHONE_CHANGE 0.055218 REGION_RATING_CLIENT 0.058899 REGION_RATING_CLIENT_W_CITY 0.060893 DAYS_BIRTH 0.078239 TARGET 1.000000 Name: TARGET, dtype: float64 Most Negative Correlations: EXT_SOURCE_3 -0.178919 EXT_SOURCE_2 -0.160472 EXT_SOURCE_1 -0.155317 DAYS_EMPLOYED -0.044932 FLOORSMAX_AVG -0.044003 FLOORSMAX_MEDI -0.043768 FLOORSMAX_MODE -0.043226 AMT_GOODS_PRICE -0.039645 REGION_POPULATION_RELATIVE -0.037227 ELEVATORS_AVG -0.034199 Name: TARGET, dtype: float64
tail_10_corr = application_train_corr[[_ for _ in tail_10.index]].loc[[_ for _ in tail_10.index]]
head_10_corr_list = [_ for _ in head_10.index]
head_10_corr_list.append("TARGET")
head_10_corr = application_train_corr[head_10_corr_list].loc[head_10_corr_list]
# set up the matplotlib figure
f, ax = plt.subplots(1,2, figsize=(25, 25), dpi=400)
# generate a mask for the lower triangle
mask = np.zeros_like(tail_10_corr, dtype=np.bool_)
mask[np.triu_indices_from(mask)] = True
# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 11, as_cmap=True)
# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(tail_10_corr, mask=mask, cmap=cmap, vmax=.3,
square=True,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax[0], annot=True);
ax[0].set_title("Most Positive Correlations")
# generate a mask for the lower triangle
mask = np.zeros_like(head_10_corr, dtype=np.bool_)
mask[np.triu_indices_from(mask)] = True
# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 11, as_cmap=True)
# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(head_10_corr, mask=mask, cmap=cmap, vmax=.3,
square=True,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax[1], annot=True);
ax[1].set_title("Most Negative Correlations")
Text(0.5, 1.0, 'Most Negative Correlations')
plt.hist(datasets["application_train"]['DAYS_BIRTH'] / -365, edgecolor = 'k', bins = 25)
plt.title('Age of Client')
plt.xlabel('Age (years)')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
fig, ax = plt.subplots(1,1, figsize=(10,10),dpi=400)
plt.hist(datasets["application_train"]['DAYS_EMPLOYED'] /365, bins = 25)
plt.title('')
plt.xlabel('Years of Employment (years)')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
DAYS_EMPLOYED : some rows have value as 365243( equivalent to 1000 years),i.e some people are employed for 1000 years¶fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"],ax=ax);
ax.set_title('Applicants Occupation');
plt.xticks(rotation=90);
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.countplot(x='NAME_INCOME_TYPE', data=datasets["application_train"],ax=ax);
ax.set_title('Applicants Income Type');
plt.xticks(rotation=90);
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.countplot(x='NAME_HOUSING_TYPE', data=datasets["application_train"],ax=ax);
ax.set_title('Applicants Housing Type');
plt.xticks(rotation=90);
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.countplot(x='TARGET', data=datasets["application_train"],ax=ax, hue="CODE_GENDER");
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.countplot(x='CODE_GENDER', data=datasets["application_train"],ax=ax, hue="NAME_FAMILY_STATUS")
<matplotlib.axes._subplots.AxesSubplot at 0x7fae27e693d0>
features = ["TARGET", "EXT_SOURCE_3", "EXT_SOURCE_2", "EXT_SOURCE_1", "DAYS_EMPLOYED"]
sns.pairplot(datasets["application_train"][features])
<seaborn.axisgrid.PairGrid at 0x7fae27e23cd0>
TARGET and "EXT_SOURCE_3", "EXT_SOURCE_2", "EXT_SOURCE_1", "DAYS_EMPLOYED" is not linear and monotonic.¶bureau¶bur = datasets["bureau"]
bur.head(30)
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.00 | 0.000 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.00 | 171342.000 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.50 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.00 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.50 | 0 | 2700000.00 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
| 5 | 215354 | 5714467 | Active | currency 1 | -273 | 0 | 27460.0 | NaN | 0.00 | 0 | 180000.00 | 71017.380 | 108982.620 | 0.0 | Credit card | -31 | NaN |
| 6 | 215354 | 5714468 | Active | currency 1 | -43 | 0 | 79.0 | NaN | 0.00 | 0 | 42103.80 | 42103.800 | 0.000 | 0.0 | Consumer credit | -22 | NaN |
| 7 | 162297 | 5714469 | Closed | currency 1 | -1896 | 0 | -1684.0 | -1710.0 | 14985.00 | 0 | 76878.45 | 0.000 | 0.000 | 0.0 | Consumer credit | -1710 | NaN |
| 8 | 162297 | 5714470 | Closed | currency 1 | -1146 | 0 | -811.0 | -840.0 | 0.00 | 0 | 103007.70 | 0.000 | 0.000 | 0.0 | Consumer credit | -840 | NaN |
| 9 | 162297 | 5714471 | Active | currency 1 | -1146 | 0 | -484.0 | NaN | 0.00 | 0 | 4500.00 | 0.000 | 0.000 | 0.0 | Credit card | -690 | NaN |
| 10 | 162297 | 5714472 | Active | currency 1 | -1146 | 0 | -180.0 | NaN | 0.00 | 0 | 337500.00 | 0.000 | 0.000 | 0.0 | Credit card | -690 | NaN |
| 11 | 162297 | 5714473 | Closed | currency 1 | -2456 | 0 | -629.0 | -825.0 | NaN | 0 | 675000.00 | 0.000 | 0.000 | 0.0 | Consumer credit | -706 | NaN |
| 12 | 162297 | 5714474 | Active | currency 1 | -277 | 0 | 5261.0 | NaN | 0.00 | 0 | 7033500.00 | NaN | NaN | 0.0 | Mortgage | -31 | NaN |
| 13 | 402440 | 5714475 | Active | currency 1 | -96 | 0 | 269.0 | NaN | 0.00 | 0 | 89910.00 | 76905.000 | 0.000 | 0.0 | Consumer credit | -22 | NaN |
| 14 | 238881 | 5714482 | Closed | currency 1 | -318 | 0 | -187.0 | -187.0 | NaN | 0 | 0.00 | 0.000 | 0.000 | 0.0 | Credit card | -185 | NaN |
| 15 | 238881 | 5714484 | Closed | currency 1 | -2911 | 0 | -2607.0 | -2604.0 | NaN | 0 | 48555.00 | NaN | NaN | 0.0 | Consumer credit | -2601 | NaN |
| 16 | 238881 | 5714485 | Closed | currency 1 | -2148 | 0 | -1595.0 | -987.0 | NaN | 0 | 135000.00 | NaN | NaN | 0.0 | Consumer credit | -984 | NaN |
| 17 | 238881 | 5714486 | Active | currency 1 | -381 | 0 | NaN | NaN | NaN | 0 | 450000.00 | 520920.000 | NaN | 0.0 | Consumer credit | -4 | NaN |
| 18 | 238881 | 5714487 | Active | currency 1 | -95 | 0 | 1720.0 | NaN | NaN | 0 | 67500.00 | 8131.500 | NaN | 0.0 | Credit card | -7 | NaN |
| 19 | 238881 | 5714488 | Closed | currency 1 | -444 | 0 | -77.0 | -77.0 | 0.00 | 0 | 107184.06 | 0.000 | 0.000 | 0.0 | Consumer credit | -71 | NaN |
| 20 | 238881 | 5714489 | Active | currency 1 | -392 | 0 | NaN | NaN | 0.00 | 0 | 252000.00 | 23679.000 | 228320.100 | 0.0 | Credit card | -22 | NaN |
| 21 | 222183 | 5714491 | Active | currency 1 | -784 | 0 | 1008.0 | NaN | 0.00 | 0 | 0.00 | -411.615 | 411.615 | 0.0 | Credit card | -694 | NaN |
| 22 | 222183 | 5714492 | Active | currency 1 | -774 | 0 | 625.0 | NaN | NaN | 0 | 127840.50 | 0.000 | 0.000 | 0.0 | Credit card | -210 | NaN |
| 23 | 222183 | 5714493 | Active | currency 1 | -395 | 0 | 1431.0 | NaN | NaN | 0 | 1350000.00 | 1185493.500 | 0.000 | 0.0 | Consumer credit | -24 | NaN |
| 24 | 222183 | 5714495 | Closed | currency 1 | -2744 | 0 | -2561.0 | -2559.0 | 310.50 | 0 | 18157.50 | NaN | NaN | 0.0 | Consumer credit | -2559 | NaN |
| 25 | 222183 | 5714496 | Closed | currency 1 | -1103 | 0 | -7.0 | -343.0 | 20493.27 | 0 | 675000.00 | 0.000 | 0.000 | 0.0 | Consumer credit | -343 | NaN |
| 26 | 222183 | 5714497 | Active | currency 1 | -315 | 0 | 1512.0 | NaN | 88821.00 | 0 | 3709552.50 | NaN | NaN | 0.0 | Car loan | -32 | NaN |
| 27 | 426155 | 5714498 | Closed | currency 1 | -1331 | 0 | -994.0 | -1023.0 | 1350.00 | 0 | 39433.50 | 0.000 | 0.000 | 0.0 | Consumer credit | -1023 | NaN |
| 28 | 426155 | 5714499 | Closed | currency 1 | -2534 | 0 | -2352.0 | -2347.0 | NaN | 0 | 38830.50 | 0.000 | 0.000 | 0.0 | Consumer credit | -2345 | NaN |
| 29 | 426155 | 5714500 | Closed | currency 1 | -845 | 0 | -480.0 | -480.0 | 0.00 | 0 | 67500.00 | 0.000 | 0.000 | 0.0 | Consumer credit | -480 | NaN |
bur.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB
bur.columns
Index(['SK_ID_CURR', 'SK_ID_BUREAU', 'CREDIT_ACTIVE', 'CREDIT_CURRENCY',
'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE',
'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG',
'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT',
'AMT_CREDIT_SUM_OVERDUE', 'CREDIT_TYPE', 'DAYS_CREDIT_UPDATE',
'AMT_ANNUITY'],
dtype='object')
bur.describe()
| SK_ID_CURR | SK_ID_BUREAU | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.716428e+06 | 1.716428e+06 | 1.716428e+06 | 1.716428e+06 | 1.610875e+06 | 1.082775e+06 | 5.919400e+05 | 1.716428e+06 | 1.716415e+06 | 1.458759e+06 | 1.124648e+06 | 1.716428e+06 | 1.716428e+06 | 4.896370e+05 |
| mean | 2.782149e+05 | 5.924434e+06 | -1.142108e+03 | 8.181666e-01 | 5.105174e+02 | -1.017437e+03 | 3.825418e+03 | 6.410406e-03 | 3.549946e+05 | 1.370851e+05 | 6.229515e+03 | 3.791276e+01 | -5.937483e+02 | 1.571276e+04 |
| std | 1.029386e+05 | 5.322657e+05 | 7.951649e+02 | 3.654443e+01 | 4.994220e+03 | 7.140106e+02 | 2.060316e+05 | 9.622391e-02 | 1.149811e+06 | 6.774011e+05 | 4.503203e+04 | 5.937650e+03 | 7.207473e+02 | 3.258269e+05 |
| min | 1.000010e+05 | 5.000000e+06 | -2.922000e+03 | 0.000000e+00 | -4.206000e+04 | -4.202300e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -4.705600e+06 | -5.864061e+05 | 0.000000e+00 | -4.194700e+04 | 0.000000e+00 |
| 25% | 1.888668e+05 | 5.463954e+06 | -1.666000e+03 | 0.000000e+00 | -1.138000e+03 | -1.489000e+03 | 0.000000e+00 | 0.000000e+00 | 5.130000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -9.080000e+02 | 0.000000e+00 |
| 50% | 2.780550e+05 | 5.926304e+06 | -9.870000e+02 | 0.000000e+00 | -3.300000e+02 | -8.970000e+02 | 0.000000e+00 | 0.000000e+00 | 1.255185e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -3.950000e+02 | 0.000000e+00 |
| 75% | 3.674260e+05 | 6.385681e+06 | -4.740000e+02 | 0.000000e+00 | 4.740000e+02 | -4.250000e+02 | 0.000000e+00 | 0.000000e+00 | 3.150000e+05 | 4.015350e+04 | 0.000000e+00 | 0.000000e+00 | -3.300000e+01 | 1.350000e+04 |
| max | 4.562550e+05 | 6.843457e+06 | 0.000000e+00 | 2.792000e+03 | 3.119900e+04 | 0.000000e+00 | 1.159872e+08 | 9.000000e+00 | 5.850000e+08 | 1.701000e+08 | 4.705600e+06 | 3.756681e+06 | 3.720000e+02 | 1.184534e+08 |
bur.isna().sum()
SK_ID_CURR 0 SK_ID_BUREAU 0 CREDIT_ACTIVE 0 CREDIT_CURRENCY 0 DAYS_CREDIT 0 CREDIT_DAY_OVERDUE 0 DAYS_CREDIT_ENDDATE 105553 DAYS_ENDDATE_FACT 633653 AMT_CREDIT_MAX_OVERDUE 1124488 CNT_CREDIT_PROLONG 0 AMT_CREDIT_SUM 13 AMT_CREDIT_SUM_DEBT 257669 AMT_CREDIT_SUM_LIMIT 591780 AMT_CREDIT_SUM_OVERDUE 0 CREDIT_TYPE 0 DAYS_CREDIT_UPDATE 0 AMT_ANNUITY 1226791 dtype: int64
app_train = datasets['application_train']
app_train_req = app_train[["SK_ID_CURR", "TARGET"]]
merged_df = pd.merge(bur, app_train_req, how="left")
merged_df.corr()["TARGET"].sort_values()
AMT_CREDIT_SUM -0.010606 SK_ID_BUREAU -0.009018 AMT_CREDIT_SUM_LIMIT -0.005990 SK_ID_CURR -0.003024 AMT_ANNUITY 0.000117 CNT_CREDIT_PROLONG 0.001523 AMT_CREDIT_MAX_OVERDUE 0.001587 AMT_CREDIT_SUM_DEBT 0.002539 CREDIT_DAY_OVERDUE 0.002652 AMT_CREDIT_SUM_OVERDUE 0.006253 DAYS_CREDIT_ENDDATE 0.026497 DAYS_ENDDATE_FACT 0.039057 DAYS_CREDIT_UPDATE 0.041076 DAYS_CREDIT 0.061556 TARGET 1.000000 Name: TARGET, dtype: float64
# set up the matplotlib figure
f, ax = plt.subplots(1,1, figsize=(25, 25),dpi=400)
# generate a mask for the lower triangle
mask = np.zeros_like(merged_df.corr(), dtype=np.bool_)
mask[np.triu_indices_from(mask)] = True
# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 11, as_cmap=True)
# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(merged_df.corr(), mask=mask, cmap=cmap, vmax=.3,
square=True,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax, annot=True);
ax.set_title("Correaltion matrix for Bureau and Target ")
Text(0.5, 1.0, 'Correaltion matrix for Bureau and Target ')
app_train_req[app_train_req.TARGET==0]
| SK_ID_CURR | TARGET | |
|---|---|---|
| 1 | 100003 | 0 |
| 2 | 100004 | 0 |
| 3 | 100006 | 0 |
| 4 | 100007 | 0 |
| 5 | 100008 | 0 |
| ... | ... | ... |
| 307505 | 456249 | 0 |
| 307506 | 456251 | 0 |
| 307507 | 456252 | 0 |
| 307508 | 456253 | 0 |
| 307510 | 456255 | 0 |
282686 rows × 2 columns
len(app_train_req.SK_ID_CURR.unique())
307511
len(bur.SK_ID_CURR.unique())
305811
len(datasets["application_test"].SK_ID_CURR.unique())
48744
train_diff = np.setdiff1d(bur.SK_ID_CURR.unique(), app_train_req.SK_ID_CURR.unique())
test_diff = np.setdiff1d(train_diff, datasets["application_test"].SK_ID_CURR.unique())
len(test_diff)
0
Bureau contains record for all SK_ID_CURR¶observation_ids = [100002, 100031, 100003]
for grp, df in bur.groupby("SK_ID_CURR"):
if grp in observation_ids:
display(pd.merge(df, app_train_req, how="left"))
observation_ids.remove(grp)
if len(observation_ids) ==0:
break
if len(observation_ids) !=0:
print("NO PREVIOUS APPLICATION FOUND FOR :{}".format(observation_ids))
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 6158904 | Closed | currency 1 | -1125 | 0 | -1038.0 | -1038.0 | NaN | 0 | 40761.000 | NaN | NaN | 0.0 | Credit card | -1038 | 0.0 | 1 |
| 1 | 100002 | 6158905 | Closed | currency 1 | -476 | 0 | NaN | -48.0 | NaN | 0 | 0.000 | 0.0 | NaN | 0.0 | Credit card | -47 | NaN | 1 |
| 2 | 100002 | 6158906 | Closed | currency 1 | -1437 | 0 | -1072.0 | -1185.0 | 0.000 | 0 | 135000.000 | 0.0 | 0.000 | 0.0 | Consumer credit | -1185 | 0.0 | 1 |
| 3 | 100002 | 6158907 | Closed | currency 1 | -1121 | 0 | -911.0 | -911.0 | 3321.000 | 0 | 19071.000 | NaN | NaN | 0.0 | Consumer credit | -906 | 0.0 | 1 |
| 4 | 100002 | 6158908 | Closed | currency 1 | -645 | 0 | 85.0 | -36.0 | 5043.645 | 0 | 120735.000 | 0.0 | 0.000 | 0.0 | Consumer credit | -34 | 0.0 | 1 |
| 5 | 100002 | 6158909 | Active | currency 1 | -103 | 0 | NaN | NaN | 40.500 | 0 | 31988.565 | 0.0 | 31988.565 | 0.0 | Credit card | -24 | 0.0 | 1 |
| 6 | 100002 | 6158903 | Active | currency 1 | -1042 | 0 | 780.0 | NaN | NaN | 0 | 450000.000 | 245781.0 | 0.000 | 0.0 | Consumer credit | -7 | 0.0 | 1 |
| 7 | 100002 | 6113835 | Closed | currency 1 | -1043 | 0 | 62.0 | -967.0 | 0.000 | 0 | 67500.000 | NaN | NaN | 0.0 | Credit card | -758 | 0.0 | 1 |
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100003 | 5885877 | Closed | currency 1 | -2586 | 0 | -2434.0 | -2131.0 | 0.0 | 0 | 22248.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -2131 | NaN | 0 |
| 1 | 100003 | 5885878 | Closed | currency 1 | -1636 | 0 | -540.0 | -540.0 | 0.0 | 0 | 112500.0 | 0.0 | 0.0 | 0.0 | Credit card | -540 | NaN | 0 |
| 2 | 100003 | 5885879 | Closed | currency 1 | -775 | 0 | -420.0 | -621.0 | 0.0 | 0 | 72652.5 | 0.0 | 0.0 | 0.0 | Consumer credit | -550 | NaN | 0 |
| 3 | 100003 | 5885880 | Active | currency 1 | -606 | 0 | 1216.0 | NaN | 0.0 | 0 | 810000.0 | 0.0 | 810000.0 | 0.0 | Credit card | -43 | NaN | 0 |
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100031 | 6187667 | Active | currency 1 | -180 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -9 | NaN | 1 |
| 1 | 100031 | 6187661 | Closed | currency 1 | -1365 | 0 | -269.0 | -991.0 | NaN | 0 | 573358.5 | 0.0 | NaN | 0.0 | Consumer credit | -989 | NaN | 1 |
| 2 | 100031 | 6187662 | Closed | currency 1 | -992 | 0 | 469.0 | -499.0 | NaN | 0 | 1125000.0 | 1125000.0 | NaN | 0.0 | Consumer credit | -499 | NaN | 1 |
| 3 | 100031 | 6187663 | Closed | currency 1 | -501 | 0 | 960.0 | -233.0 | NaN | 0 | 1350000.0 | NaN | NaN | 0.0 | Consumer credit | -211 | NaN | 1 |
| 4 | 100031 | 6187664 | Closed | currency 1 | -203 | 0 | 162.0 | -73.0 | NaN | 0 | 171013.5 | NaN | NaN | 0.0 | Consumer credit | -72 | NaN | 1 |
| 5 | 100031 | 6187665 | Active | currency 1 | -741 | 0 | NaN | NaN | NaN | 0 | 112500.0 | NaN | NaN | 0.0 | Credit card | -72 | NaN | 1 |
| 6 | 100031 | 6187666 | Active | currency 1 | -75 | 0 | 1021.0 | NaN | NaN | 0 | 548779.5 | NaN | NaN | 0.0 | Consumer credit | -12 | NaN | 1 |
fig, ax = plt.subplots(1,1, figsize=(10,10),dpi=400)
sns.countplot(x='CREDIT_TYPE', data=bur,ax=ax)
ax.set_title("Applicants Credit Type")
plt.xticks(rotation="90")
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), <a list of 15 Text major ticklabel objects>)
fig, ax = plt.subplots(1,1, figsize=(15,15), dpi=400)
sns.histplot(bur["DAYS_CREDIT"]/-365,ax=ax, kde=True, bins=24)
ax.set_title("Applicants Days Credit")
ax.set_xlabel("Number of years")
Text(0.5, 0, 'Number of years')
fig, ax = plt.subplots(1,1, figsize=(15,15), dpi=400)
sns.countplot(x="CREDIT_CURRENCY", data=bur,ax=ax, hue="CREDIT_ACTIVE")
<matplotlib.axes._subplots.AxesSubplot at 0x7f6f7529e790>
fig, ax = plt.subplots(1,1, figsize=(15,15), dpi=400)
sns.countplot(x="CREDIT_ACTIVE", data=bur,ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x7f6f74e7df10>
fig, ax = plt.subplots(1,1, figsize=(15,15), dpi=400)
sns.histplot(bur["CREDIT_DAY_OVERDUE"]/365,ax=ax)
ax.set_title("Applicants: Credit overdue")
ax.set_xlabel("Credit overdue in years")
Text(0.5, 0, 'Credit overdue in years')
CREDIT_CURRENCY has 4 types but the data majorly contain only one type¶fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.histplot(bur["SK_ID_CURR"].value_counts().sort_values(), cumulative=True, ax=ax)
ax.set_title("Credit Bureau records of applicants ")
ax.set_xlabel("Applicants id")
ax.set_ylabel("Cumulative Records")
Text(0, 0.5, 'Cumulative Records')
bur_sk_ids = bur["SK_ID_CURR"].value_counts().sort_values()
print(" >5 : --> {}\n >10 : --> {}\n >15 : --> {}\n >20 : --> {}".format(
len(bur_sk_ids[bur_sk_ids >5]),
len(bur_sk_ids[bur_sk_ids >10]),
len(bur_sk_ids[bur_sk_ids >15]),
len(bur_sk_ids[bur_sk_ids >20]),
))
>5 : --> 123079 >10 : --> 38328 >15 : --> 10974 >20 : --> 3079
agg_data = bur.groupby("SK_ID_CURR").agg(['mean','count','sum','min','max'])
agg_data.columns
MultiIndex([( 'SK_ID_BUREAU', 'mean'),
( 'SK_ID_BUREAU', 'count'),
( 'SK_ID_BUREAU', 'sum'),
( 'SK_ID_BUREAU', 'min'),
( 'SK_ID_BUREAU', 'max'),
( 'DAYS_CREDIT', 'mean'),
( 'DAYS_CREDIT', 'count'),
( 'DAYS_CREDIT', 'sum'),
( 'DAYS_CREDIT', 'min'),
( 'DAYS_CREDIT', 'max'),
( 'CREDIT_DAY_OVERDUE', 'mean'),
( 'CREDIT_DAY_OVERDUE', 'count'),
( 'CREDIT_DAY_OVERDUE', 'sum'),
( 'CREDIT_DAY_OVERDUE', 'min'),
( 'CREDIT_DAY_OVERDUE', 'max'),
( 'DAYS_CREDIT_ENDDATE', 'mean'),
( 'DAYS_CREDIT_ENDDATE', 'count'),
( 'DAYS_CREDIT_ENDDATE', 'sum'),
( 'DAYS_CREDIT_ENDDATE', 'min'),
( 'DAYS_CREDIT_ENDDATE', 'max'),
( 'DAYS_ENDDATE_FACT', 'mean'),
( 'DAYS_ENDDATE_FACT', 'count'),
( 'DAYS_ENDDATE_FACT', 'sum'),
( 'DAYS_ENDDATE_FACT', 'min'),
( 'DAYS_ENDDATE_FACT', 'max'),
('AMT_CREDIT_MAX_OVERDUE', 'mean'),
('AMT_CREDIT_MAX_OVERDUE', 'count'),
('AMT_CREDIT_MAX_OVERDUE', 'sum'),
('AMT_CREDIT_MAX_OVERDUE', 'min'),
('AMT_CREDIT_MAX_OVERDUE', 'max'),
( 'CNT_CREDIT_PROLONG', 'mean'),
( 'CNT_CREDIT_PROLONG', 'count'),
( 'CNT_CREDIT_PROLONG', 'sum'),
( 'CNT_CREDIT_PROLONG', 'min'),
( 'CNT_CREDIT_PROLONG', 'max'),
( 'AMT_CREDIT_SUM', 'mean'),
( 'AMT_CREDIT_SUM', 'count'),
( 'AMT_CREDIT_SUM', 'sum'),
( 'AMT_CREDIT_SUM', 'min'),
( 'AMT_CREDIT_SUM', 'max'),
( 'AMT_CREDIT_SUM_DEBT', 'mean'),
( 'AMT_CREDIT_SUM_DEBT', 'count'),
( 'AMT_CREDIT_SUM_DEBT', 'sum'),
( 'AMT_CREDIT_SUM_DEBT', 'min'),
( 'AMT_CREDIT_SUM_DEBT', 'max'),
( 'AMT_CREDIT_SUM_LIMIT', 'mean'),
( 'AMT_CREDIT_SUM_LIMIT', 'count'),
( 'AMT_CREDIT_SUM_LIMIT', 'sum'),
( 'AMT_CREDIT_SUM_LIMIT', 'min'),
( 'AMT_CREDIT_SUM_LIMIT', 'max'),
('AMT_CREDIT_SUM_OVERDUE', 'mean'),
('AMT_CREDIT_SUM_OVERDUE', 'count'),
('AMT_CREDIT_SUM_OVERDUE', 'sum'),
('AMT_CREDIT_SUM_OVERDUE', 'min'),
('AMT_CREDIT_SUM_OVERDUE', 'max'),
( 'DAYS_CREDIT_UPDATE', 'mean'),
( 'DAYS_CREDIT_UPDATE', 'count'),
( 'DAYS_CREDIT_UPDATE', 'sum'),
( 'DAYS_CREDIT_UPDATE', 'min'),
( 'DAYS_CREDIT_UPDATE', 'max'),
( 'AMT_ANNUITY', 'mean'),
( 'AMT_ANNUITY', 'count'),
( 'AMT_ANNUITY', 'sum'),
( 'AMT_ANNUITY', 'min'),
( 'AMT_ANNUITY', 'max')],
)
print("-------------DAYS_CREDIT--------------------")
display(agg_data.head(5)["DAYS_CREDIT"])
print("-------------CREDIT_DAY_OVERDUE--------------------")
display(agg_data.head(5)["CREDIT_DAY_OVERDUE"])
print("-------------AMT_CREDIT_MAX_OVERDUE--------------------")
display(agg_data.head(5)["AMT_CREDIT_MAX_OVERDUE"])
print("-------------AMT_CREDIT_SUM--------------------")
display(agg_data.head(5)["AMT_CREDIT_SUM"])
print("-------------AMT_CREDIT_SUM_LIMIT--------------------")
display(agg_data.head(5)["AMT_CREDIT_SUM_LIMIT"])
print("-------------AMT_CREDIT_SUM_OVERDUE--------------------")
display(agg_data.head(5)["AMT_CREDIT_SUM_OVERDUE"])
print("-------------AMT_ANNUITY--------------------")
display(agg_data.head(5)["AMT_ANNUITY"])
-------------DAYS_CREDIT--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_CURR | |||||
| 100001 | -735.000000 | 7 | -5145 | -1572 | -49 |
| 100002 | -874.000000 | 8 | -6992 | -1437 | -103 |
| 100003 | -1400.750000 | 4 | -5603 | -2586 | -606 |
| 100004 | -867.000000 | 2 | -1734 | -1326 | -408 |
| 100005 | -190.666667 | 3 | -572 | -373 | -62 |
-------------CREDIT_DAY_OVERDUE--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_CURR | |||||
| 100001 | 0.0 | 7 | 0 | 0 | 0 |
| 100002 | 0.0 | 8 | 0 | 0 | 0 |
| 100003 | 0.0 | 4 | 0 | 0 | 0 |
| 100004 | 0.0 | 2 | 0 | 0 | 0 |
| 100005 | 0.0 | 3 | 0 | 0 | 0 |
-------------AMT_CREDIT_MAX_OVERDUE--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_CURR | |||||
| 100001 | NaN | 0 | 0.000 | NaN | NaN |
| 100002 | 1681.029 | 5 | 8405.145 | 0.0 | 5043.645 |
| 100003 | 0.000 | 4 | 0.000 | 0.0 | 0.000 |
| 100004 | 0.000 | 1 | 0.000 | 0.0 | 0.000 |
| 100005 | 0.000 | 1 | 0.000 | 0.0 | 0.000 |
-------------AMT_CREDIT_SUM--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_CURR | |||||
| 100001 | 207623.571429 | 7 | 1453365.000 | 85500.0 | 378000.0 |
| 100002 | 108131.945625 | 8 | 865055.565 | 0.0 | 450000.0 |
| 100003 | 254350.125000 | 4 | 1017400.500 | 22248.0 | 810000.0 |
| 100004 | 94518.900000 | 2 | 189037.800 | 94500.0 | 94537.8 |
| 100005 | 219042.000000 | 3 | 657126.000 | 29826.0 | 568800.0 |
-------------AMT_CREDIT_SUM_LIMIT--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_CURR | |||||
| 100001 | 0.00000 | 6 | 0.000 | 0.0 | 0.000 |
| 100002 | 7997.14125 | 4 | 31988.565 | 0.0 | 31988.565 |
| 100003 | 202500.00000 | 4 | 810000.000 | 0.0 | 810000.000 |
| 100004 | 0.00000 | 2 | 0.000 | 0.0 | 0.000 |
| 100005 | 0.00000 | 3 | 0.000 | 0.0 | 0.000 |
-------------AMT_CREDIT_SUM_OVERDUE--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_CURR | |||||
| 100001 | 0.0 | 7 | 0.0 | 0.0 | 0.0 |
| 100002 | 0.0 | 8 | 0.0 | 0.0 | 0.0 |
| 100003 | 0.0 | 4 | 0.0 | 0.0 | 0.0 |
| 100004 | 0.0 | 2 | 0.0 | 0.0 | 0.0 |
| 100005 | 0.0 | 3 | 0.0 | 0.0 | 0.0 |
-------------AMT_ANNUITY--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_CURR | |||||
| 100001 | 3545.357143 | 7 | 24817.5 | 0.0 | 10822.5 |
| 100002 | 0.000000 | 7 | 0.0 | 0.0 | 0.0 |
| 100003 | NaN | 0 | 0.0 | NaN | NaN |
| 100004 | NaN | 0 | 0.0 | NaN | NaN |
| 100005 | 1420.500000 | 3 | 4261.5 | 0.0 | 4261.5 |
bb = datasets["bureau_balance"]
bb.head(30)
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
| 5 | 5715448 | -5 | C |
| 6 | 5715448 | -6 | C |
| 7 | 5715448 | -7 | C |
| 8 | 5715448 | -8 | C |
| 9 | 5715448 | -9 | 0 |
| 10 | 5715448 | -10 | 0 |
| 11 | 5715448 | -11 | X |
| 12 | 5715448 | -12 | X |
| 13 | 5715448 | -13 | X |
| 14 | 5715448 | -14 | 0 |
| 15 | 5715448 | -15 | 0 |
| 16 | 5715448 | -16 | 0 |
| 17 | 5715448 | -17 | 0 |
| 18 | 5715448 | -18 | 0 |
| 19 | 5715448 | -19 | 0 |
| 20 | 5715448 | -20 | X |
| 21 | 5715448 | -21 | X |
| 22 | 5715448 | -22 | X |
| 23 | 5715448 | -23 | X |
| 24 | 5715448 | -24 | X |
| 25 | 5715448 | -25 | X |
| 26 | 5715448 | -26 | X |
| 27 | 5715449 | 0 | C |
| 28 | 5715449 | -1 | C |
| 29 | 5715449 | -2 | C |
bb.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 27299925 entries, 0 to 27299924 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE int64 2 STATUS object dtypes: int64(2), object(1) memory usage: 624.8+ MB
bb.describe()
| SK_ID_BUREAU | MONTHS_BALANCE | |
|---|---|---|
| count | 2.729992e+07 | 2.729992e+07 |
| mean | 6.036297e+06 | -3.074169e+01 |
| std | 4.923489e+05 | 2.386451e+01 |
| min | 5.001709e+06 | -9.600000e+01 |
| 25% | 5.730933e+06 | -4.600000e+01 |
| 50% | 6.070821e+06 | -2.500000e+01 |
| 75% | 6.431951e+06 | -1.100000e+01 |
| max | 6.842888e+06 | 0.000000e+00 |
bb.isna().sum()
SK_ID_BUREAU 0 MONTHS_BALANCE 0 STATUS 0 dtype: int64
bb.STATUS.unique()
array(['C', '0', 'X', '1', '2', '3', '5', '4'], dtype=object)
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.countplot(x='STATUS', data=bb,ax=ax)
ax.set_title("Applicants Repayment Staus Count")
Text(0.5, 1.0, 'Applicants Repayment Staus Count')
bb.groupby("SK_ID_BUREAU").count().sort_values(by="MONTHS_BALANCE")
| MONTHS_BALANCE | STATUS | |
|---|---|---|
| SK_ID_BUREAU | ||
| 6052856 | 1 | 1 |
| 5061807 | 1 | 1 |
| 5918080 | 1 | 1 |
| 5061817 | 1 | 1 |
| 6688436 | 1 | 1 |
| ... | ... | ... |
| 5359551 | 97 | 97 |
| 6168534 | 97 | 97 |
| 6168536 | 97 | 97 |
| 6395449 | 97 | 97 |
| 5001709 | 97 | 97 |
817395 rows × 2 columns
status code indicates, do they have any siginificane??¶pa = datasets["previous_application"]
pa.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB
pa.describe()
| SK_ID_PREV | SK_ID_CURR | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | ... | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.670214e+06 | 1.670214e+06 | 1.297979e+06 | 1.670214e+06 | 1.670213e+06 | 7.743700e+05 | 1.284699e+06 | 1.670214e+06 | 1.670214e+06 | 774370.000000 | ... | 5951.000000 | 1.670214e+06 | 1.670214e+06 | 1.297984e+06 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 |
| mean | 1.923089e+06 | 2.783572e+05 | 1.595512e+04 | 1.752339e+05 | 1.961140e+05 | 6.697402e+03 | 2.278473e+05 | 1.248418e+01 | 9.964675e-01 | 0.079637 | ... | 0.773503 | -8.806797e+02 | 3.139511e+02 | 1.605408e+01 | 342209.855039 | 13826.269337 | 33767.774054 | 76582.403064 | 81992.343838 | 0.332570 |
| std | 5.325980e+05 | 1.028148e+05 | 1.478214e+04 | 2.927798e+05 | 3.185746e+05 | 2.092150e+04 | 3.153966e+05 | 3.334028e+00 | 5.932963e-02 | 0.107823 | ... | 0.100879 | 7.790997e+02 | 7.127443e+03 | 1.456729e+01 | 88916.115834 | 72444.869708 | 106857.034789 | 149647.415123 | 153303.516729 | 0.471134 |
| min | 1.000001e+06 | 1.000010e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -9.000000e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -0.000015 | ... | 0.373150 | -2.922000e+03 | -1.000000e+00 | 0.000000e+00 | -2922.000000 | -2892.000000 | -2801.000000 | -2889.000000 | -2874.000000 | 0.000000 |
| 25% | 1.461857e+06 | 1.893290e+05 | 6.321780e+03 | 1.872000e+04 | 2.416050e+04 | 0.000000e+00 | 5.084100e+04 | 1.000000e+01 | 1.000000e+00 | 0.000000 | ... | 0.715645 | -1.300000e+03 | -1.000000e+00 | 6.000000e+00 | 365243.000000 | -1628.000000 | -1242.000000 | -1314.000000 | -1270.000000 | 0.000000 |
| 50% | 1.923110e+06 | 2.787145e+05 | 1.125000e+04 | 7.104600e+04 | 8.054100e+04 | 1.638000e+03 | 1.123200e+05 | 1.200000e+01 | 1.000000e+00 | 0.051605 | ... | 0.835095 | -5.810000e+02 | 3.000000e+00 | 1.200000e+01 | 365243.000000 | -831.000000 | -361.000000 | -537.000000 | -499.000000 | 0.000000 |
| 75% | 2.384280e+06 | 3.675140e+05 | 2.065842e+04 | 1.803600e+05 | 2.164185e+05 | 7.740000e+03 | 2.340000e+05 | 1.500000e+01 | 1.000000e+00 | 0.108909 | ... | 0.852537 | -2.800000e+02 | 8.200000e+01 | 2.400000e+01 | 365243.000000 | -411.000000 | 129.000000 | -74.000000 | -44.000000 | 1.000000 |
| max | 2.845382e+06 | 4.562550e+05 | 4.180581e+05 | 6.905160e+06 | 6.905160e+06 | 3.060045e+06 | 6.905160e+06 | 2.300000e+01 | 1.000000e+00 | 1.000000 | ... | 1.000000 | -1.000000e+00 | 4.000000e+06 | 8.400000e+01 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 1.000000 |
8 rows × 21 columns
pa.head(10)
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 1383531 | 199383 | Cash loans | 23703.930 | 315000.0 | 340573.5 | NaN | 315000.0 | SATURDAY | 8 | ... | XNA | 18.0 | low_normal | Cash X-Sell: low | 365243.0 | -654.0 | -144.0 | -144.0 | -137.0 | 1.0 |
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | TUESDAY | 11 | ... | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | 1656711 | 296299 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | MONDAY | 7 | ... | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | 2367563 | 342292 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | MONDAY | 15 | ... | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | 2579447 | 334349 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | SATURDAY | 15 | ... | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
10 rows × 37 columns
pa.isna().sum()
SK_ID_PREV 0 SK_ID_CURR 0 NAME_CONTRACT_TYPE 0 AMT_ANNUITY 372235 AMT_APPLICATION 0 AMT_CREDIT 1 AMT_DOWN_PAYMENT 895844 AMT_GOODS_PRICE 385515 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 FLAG_LAST_APPL_PER_CONTRACT 0 NFLAG_LAST_APPL_IN_DAY 0 RATE_DOWN_PAYMENT 895844 RATE_INTEREST_PRIMARY 1664263 RATE_INTEREST_PRIVILEGED 1664263 NAME_CASH_LOAN_PURPOSE 0 NAME_CONTRACT_STATUS 0 DAYS_DECISION 0 NAME_PAYMENT_TYPE 0 CODE_REJECT_REASON 0 NAME_TYPE_SUITE 820405 NAME_CLIENT_TYPE 0 NAME_GOODS_CATEGORY 0 NAME_PORTFOLIO 0 NAME_PRODUCT_TYPE 0 CHANNEL_TYPE 0 SELLERPLACE_AREA 0 NAME_SELLER_INDUSTRY 0 CNT_PAYMENT 372230 NAME_YIELD_GROUP 0 PRODUCT_COMBINATION 346 DAYS_FIRST_DRAWING 673065 DAYS_FIRST_DUE 673065 DAYS_LAST_DUE_1ST_VERSION 673065 DAYS_LAST_DUE 673065 DAYS_TERMINATION 673065 NFLAG_INSURED_ON_APPROVAL 673065 dtype: int64
len(pa.SK_ID_CURR.unique())
338857
app_train = datasets['application_train']
app_train_req = app_train[["SK_ID_CURR", "TARGET"]]
merged_df = pd.merge(pa, app_train_req, how="left")
merged_df
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 | 0.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 | 0.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 | 0.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1670209 | 2300464 | 352015 | Consumer loans | 14704.290 | 267295.5 | 311400.0 | 0.0 | 267295.5 | WEDNESDAY | 12 | ... | 30.0 | low_normal | POS industry with interest | 365243.0 | -508.0 | 362.0 | -358.0 | -351.0 | 0.0 | NaN |
| 1670210 | 2357031 | 334635 | Consumer loans | 6622.020 | 87750.0 | 64291.5 | 29250.0 | 87750.0 | TUESDAY | 15 | ... | 12.0 | middle | POS industry with interest | 365243.0 | -1604.0 | -1274.0 | -1304.0 | -1297.0 | 0.0 | NaN |
| 1670211 | 2659632 | 249544 | Consumer loans | 11520.855 | 105237.0 | 102523.5 | 10525.5 | 105237.0 | MONDAY | 12 | ... | 10.0 | low_normal | POS household with interest | 365243.0 | -1457.0 | -1187.0 | -1187.0 | -1181.0 | 0.0 | 0.0 |
| 1670212 | 2785582 | 400317 | Cash loans | 18821.520 | 180000.0 | 191880.0 | NaN | 180000.0 | WEDNESDAY | 9 | ... | 12.0 | low_normal | Cash X-Sell: low | 365243.0 | -1155.0 | -825.0 | -825.0 | -817.0 | 1.0 | 0.0 |
| 1670213 | 2418762 | 261212 | Cash loans | 16431.300 | 360000.0 | 360000.0 | NaN | 360000.0 | SUNDAY | 10 | ... | 48.0 | middle | Cash X-Sell: middle | 365243.0 | -1163.0 | 247.0 | -443.0 | -423.0 | 0.0 | 0.0 |
1670214 rows × 38 columns
merged_df.corr()["TARGET"].sort_values()
DAYS_FIRST_DRAWING -0.031154 HOUR_APPR_PROCESS_START -0.027809 RATE_DOWN_PAYMENT -0.026111 AMT_DOWN_PAYMENT -0.016918 AMT_ANNUITY -0.014922 DAYS_FIRST_DUE -0.006651 AMT_APPLICATION -0.005583 NFLAG_LAST_APPL_IN_DAY -0.002887 SELLERPLACE_AREA -0.002539 AMT_CREDIT -0.002350 RATE_INTEREST_PRIMARY -0.001470 SK_ID_CURR -0.001246 AMT_GOODS_PRICE 0.000254 NFLAG_INSURED_ON_APPROVAL 0.000653 SK_ID_PREV 0.002009 DAYS_TERMINATION 0.016981 DAYS_LAST_DUE 0.017522 DAYS_LAST_DUE_1ST_VERSION 0.018021 RATE_INTEREST_PRIVILEGED 0.028640 CNT_PAYMENT 0.030480 DAYS_DECISION 0.039901 TARGET 1.000000 Name: TARGET, dtype: float64
# set up the matplotlib figure
f, ax = plt.subplots(1,1, figsize=(25, 25), dpi=400)
# generate a mask for the lower triangle
mask = np.zeros_like(merged_df.corr(), dtype=np.bool_)
mask[np.triu_indices_from(mask)] = True
# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 11, as_cmap=True)
# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(merged_df.corr(), mask=mask, cmap=cmap, vmax=.3,
square=True,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax, annot=True);
ax.set_title("Correaltion matrix for Previous Applications and Target ")
Text(0.5, 1.0, 'Correaltion matrix for Previous Applications and Target ')
pa.groupby("SK_ID_CURR").count()
| SK_ID_PREV | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | |||||||||||||||||||||
| 100001 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 100002 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 100003 | 3 | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 3 | 3 | ... | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| 100004 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 100005 | 2 | 2 | 1 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | ... | 2 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 456251 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 456252 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 456253 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | ... | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| 456254 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | ... | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| 456255 | 8 | 8 | 8 | 8 | 8 | 3 | 8 | 8 | 8 | 8 | ... | 8 | 8 | 8 | 8 | 6 | 6 | 6 | 6 | 6 | 6 |
338857 rows × 36 columns
previous_application when grouped by "SK_ID_CURR"¶app_train_req[app_train_req.TARGET==1].head(20)
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100002 | 1 |
| 26 | 100031 | 1 |
| 40 | 100047 | 1 |
| 42 | 100049 | 1 |
| 81 | 100096 | 1 |
| 94 | 100112 | 1 |
| 110 | 100130 | 1 |
| 138 | 100160 | 1 |
| 154 | 100181 | 1 |
| 163 | 100192 | 1 |
| 180 | 100209 | 1 |
| 184 | 100214 | 1 |
| 211 | 100246 | 1 |
| 235 | 100273 | 1 |
| 242 | 100282 | 1 |
| 246 | 100286 | 1 |
| 255 | 100295 | 1 |
| 260 | 100300 | 1 |
| 261 | 100301 | 1 |
| 283 | 100326 | 1 |
observation_ids = [100003, 100192, 100286]
for grp, df in pa.groupby("SK_ID_CURR"):
if grp in observation_ids:
display(pd.merge(df, app_train_req, how="left"))
observation_ids.remove(grp)
if len(observation_ids) ==0:
break
if len(observation_ids) !=0:
print("NO PREVIOUS APPLICATION FOUND FOR :{}".format(observation_ids))
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1810518 | 100003 | Cash loans | 98356.995 | 900000.0 | 1035882.0 | NaN | 900000.0 | FRIDAY | 12 | ... | 12.0 | low_normal | Cash X-Sell: low | 365243.0 | -716.0 | -386.0 | -536.0 | -527.0 | 1.0 | 0 |
| 1 | 2636178 | 100003 | Consumer loans | 64567.665 | 337500.0 | 348637.5 | 0.0 | 337500.0 | SUNDAY | 17 | ... | 6.0 | middle | POS industry with interest | 365243.0 | -797.0 | -647.0 | -647.0 | -639.0 | 0.0 | 0 |
| 2 | 2396755 | 100003 | Consumer loans | 6737.310 | 68809.5 | 68053.5 | 6885.0 | 68809.5 | SATURDAY | 15 | ... | 12.0 | middle | POS household with interest | 365243.0 | -2310.0 | -1980.0 | -1980.0 | -1976.0 | 1.0 | 0 |
3 rows × 38 columns
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1185166 | 100286 | Consumer loans | 4889.97 | 47385.0 | 52389.0 | 0.0 | 47385.0 | SATURDAY | 8 | ... | 12.0 | low_normal | POS household with interest | 365243.0 | -207.0 | 123.0 | 365243.0 | 365243.0 | 0.0 | 1 |
1 rows × 38 columns
NO PREVIOUS APPLICATION FOUND FOR :[100192]
previous_application exits for SK_ID_CURR¶fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.countplot(x='NAME_CONTRACT_TYPE', data=pa,ax=ax)
ax.set_title("Previous applications Contract Type")
Text(0.5, 1.0, 'Previous applications Contract Type')
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.countplot(x='NAME_YIELD_GROUP', data=pa,ax=ax)
ax.set_title("Previous applications name yield groups")
Text(0.5, 1.0, 'Previous applications name yield groups')
pa_sk_ids = pa['SK_ID_CURR'].value_counts()
len(pa_sk_ids[pa_sk_ids==1])
60458
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.histplot(pa['SK_ID_CURR'].value_counts(), element="step", ax=ax)
ax.set_ylabel("Previous application counts")
ax.set_xlabel("Applicants with similar application counts")
Text(0.5, 0, 'Applicants with similar application counts')
credit_card_balance¶ccb = datasets["credit_card_balance"]
ccb.head(10)
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.500 | 0.0 | 877.500 | 1700.325 | ... | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.000 | 0.0 | 0.000 | 2250.000 | ... | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.000 | 0.0 | 0.000 | 2250.000 | ... | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.000 | 0.0 | 0.000 | 11795.760 | ... | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.000 | 0.0 | 11547.000 | 22924.890 | ... | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
| 5 | 2646502 | 380010 | -7 | 82903.815 | 270000 | 0.0 | 0.000 | 0.0 | 0.000 | 4449.105 | ... | 82773.315 | 82773.315 | 0.0 | 0 | 0.0 | 0.0 | 2.0 | Active | 7 | 0 |
| 6 | 1079071 | 171320 | -6 | 353451.645 | 585000 | 67500.0 | 67500.000 | 0.0 | 0.000 | 14684.175 | ... | 351881.145 | 351881.145 | 1.0 | 1 | 0.0 | 0.0 | 6.0 | Active | 0 | 0 |
| 7 | 2095912 | 118650 | -7 | 47962.125 | 45000 | 45000.0 | 45000.000 | 0.0 | 0.000 | 0.000 | ... | 47962.125 | 47962.125 | 1.0 | 1 | 0.0 | 0.0 | 51.0 | Active | 0 | 0 |
| 8 | 2181852 | 367360 | -4 | 291543.075 | 292500 | 90000.0 | 289339.425 | 0.0 | 199339.425 | 130.500 | ... | 286831.575 | 286831.575 | 3.0 | 8 | 0.0 | 5.0 | 3.0 | Active | 0 | 0 |
| 9 | 1235299 | 203885 | -5 | 201261.195 | 225000 | 76500.0 | 111026.700 | 0.0 | 34526.700 | 6338.340 | ... | 197224.695 | 197224.695 | 3.0 | 9 | 0.0 | 6.0 | 38.0 | Active | 0 | 0 |
10 rows × 23 columns
ccb.describe()
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.090496e+06 | 3.840312e+06 | 3.090496e+06 | 3.090496e+06 | 3.535076e+06 | ... | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.090496e+06 | 3.840312e+06 | 3.090496e+06 | 3.090496e+06 | 3.535076e+06 | 3.840312e+06 | 3.840312e+06 |
| mean | 1.904504e+06 | 2.783242e+05 | -3.452192e+01 | 5.830016e+04 | 1.538080e+05 | 5.961325e+03 | 7.433388e+03 | 2.881696e+02 | 2.968805e+03 | 3.540204e+03 | ... | 5.596588e+04 | 5.808881e+04 | 5.809829e+04 | 3.094490e-01 | 7.031439e-01 | 4.812496e-03 | 5.594791e-01 | 2.082508e+01 | 9.283667e+00 | 3.316220e-01 |
| std | 5.364695e+05 | 1.027045e+05 | 2.666775e+01 | 1.063070e+05 | 1.651457e+05 | 2.822569e+04 | 3.384608e+04 | 8.201989e+03 | 2.079689e+04 | 5.600154e+03 | ... | 1.025336e+05 | 1.059654e+05 | 1.059718e+05 | 1.100401e+00 | 3.190347e+00 | 8.263861e-02 | 3.240649e+00 | 2.005149e+01 | 9.751570e+01 | 2.147923e+01 |
| min | 1.000018e+06 | 1.000060e+05 | -9.600000e+01 | -4.202502e+05 | 0.000000e+00 | -6.827310e+03 | -6.211620e+03 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | -4.233058e+05 | -4.202502e+05 | -4.202502e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434385e+06 | 1.895170e+05 | -5.500000e+01 | 0.000000e+00 | 4.500000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 1.897122e+06 | 2.783960e+05 | -2.800000e+01 | 0.000000e+00 | 1.125000e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.500000e+01 | 0.000000e+00 | 0.000000e+00 |
| 75% | 2.369328e+06 | 3.675800e+05 | -1.100000e+01 | 8.904669e+04 | 1.800000e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.633911e+03 | ... | 8.535924e+04 | 8.889949e+04 | 8.891451e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.200000e+01 | 0.000000e+00 | 0.000000e+00 |
| max | 2.843496e+06 | 4.562500e+05 | -1.000000e+00 | 1.505902e+06 | 1.350000e+06 | 2.115000e+06 | 2.287098e+06 | 1.529847e+06 | 2.239274e+06 | 2.028820e+05 | ... | 1.472317e+06 | 1.493338e+06 | 1.493338e+06 | 5.100000e+01 | 1.650000e+02 | 1.200000e+01 | 1.650000e+02 | 1.200000e+02 | 3.260000e+03 | 3.260000e+03 |
8 rows × 22 columns
ccb.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB
ccb.isna().sum()
SK_ID_PREV 0 SK_ID_CURR 0 MONTHS_BALANCE 0 AMT_BALANCE 0 AMT_CREDIT_LIMIT_ACTUAL 0 AMT_DRAWINGS_ATM_CURRENT 749816 AMT_DRAWINGS_CURRENT 0 AMT_DRAWINGS_OTHER_CURRENT 749816 AMT_DRAWINGS_POS_CURRENT 749816 AMT_INST_MIN_REGULARITY 305236 AMT_PAYMENT_CURRENT 767988 AMT_PAYMENT_TOTAL_CURRENT 0 AMT_RECEIVABLE_PRINCIPAL 0 AMT_RECIVABLE 0 AMT_TOTAL_RECEIVABLE 0 CNT_DRAWINGS_ATM_CURRENT 749816 CNT_DRAWINGS_CURRENT 0 CNT_DRAWINGS_OTHER_CURRENT 749816 CNT_DRAWINGS_POS_CURRENT 749816 CNT_INSTALMENT_MATURE_CUM 305236 NAME_CONTRACT_STATUS 0 SK_DPD 0 SK_DPD_DEF 0 dtype: int64
ccb["SK_ID_PREV"].value_counts(dropna=False).sort_values()
2191610 1
1383311 1
2697963 1
1083843 1
2636347 1
..
1025633 96
2045163 96
1839174 96
1824679 96
2377894 96
Name: SK_ID_PREV, Length: 104307, dtype: int64
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.histplot(ccb['SK_ID_PREV'].value_counts(), element="poly", ax=ax, cumulative=True)
ax.set_ylabel("Credit history for months")
ax.set_xlabel("Applicants with similar credit history counts")
ax.set_title("Distribution of Credit for applicants")
Text(0.5, 1.0, 'Distribution of Credit for applicants')
SK_ID_PREV to gain insights¶app_train = datasets['application_train']
app_train_req = app_train[["SK_ID_CURR", "TARGET"]]
app_train_ccb = pd.merge(app_train_req, ccb[["SK_ID_CURR","SK_ID_PREV"]], how="left")
app_train_ccb.fillna(0, inplace=True)
app_train_ccb.SK_ID_PREV=app_train_ccb.SK_ID_PREV.astype(int)
display(app_train_ccb[app_train_ccb.TARGET==1].sort_values(by="SK_ID_PREV",ascending=False).head(5))
display(app_train_ccb[app_train_ccb.TARGET==0].sort_values(by="SK_ID_PREV",ascending=False).head(5))
| SK_ID_CURR | TARGET | SK_ID_PREV | |
|---|---|---|---|
| 1067881 | 210644 | 1 | 2843461 |
| 1067868 | 210644 | 1 | 2843461 |
| 1067874 | 210644 | 1 | 2843461 |
| 1067873 | 210644 | 1 | 2843461 |
| 1067872 | 210644 | 1 | 2843461 |
| SK_ID_CURR | TARGET | SK_ID_PREV | |
|---|---|---|---|
| 2295551 | 337804 | 0 | 2843493 |
| 2295555 | 337804 | 0 | 2843493 |
| 2295547 | 337804 | 0 | 2843493 |
| 2295548 | 337804 | 0 | 2843493 |
| 2295552 | 337804 | 0 | 2843493 |
observation_ids = [2843461, 2843493, 2843478]
for grp, df in ccb.groupby("SK_ID_PREV"):
if grp in observation_ids:
display(pd.merge(df, app_train_req, how="left").sort_values(by="MONTHS_BALANCE", ascending=False))
observation_ids.remove(grp)
if len(observation_ids) ==0:
break
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | 2843461 | 210644 | -1 | 44589.735 | 45000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.0 | ... | 44589.735 | 0.0 | 0 | 0.0 | 0.0 | 51.0 | Active | 0 | 0 | 1 |
| 5 | 2843461 | 210644 | -2 | 45904.095 | 45000 | 0.0 | 9243.0 | 0.0 | 9243.0 | 2250.0 | ... | 45904.095 | 0.0 | 1 | 0.0 | 1.0 | 50.0 | Active | 0 | 0 | 1 |
| 31 | 2843461 | 210644 | -3 | 37647.225 | 45000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.0 | ... | 37647.225 | 0.0 | 0 | 0.0 | 0.0 | 49.0 | Active | 0 | 0 | 1 |
| 27 | 2843461 | 210644 | -4 | 39085.695 | 45000 | 0.0 | 1165.5 | 0.0 | 1165.5 | 2250.0 | ... | 39085.695 | 0.0 | 1 | 0.0 | 1.0 | 48.0 | Active | 0 | 0 | 1 |
| 12 | 2843461 | 210644 | -5 | 39776.490 | 45000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.0 | ... | 39776.490 | 0.0 | 0 | 0.0 | 0.0 | 47.0 | Active | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 13 | 2843461 | 210644 | -69 | 65340.000 | 90000 | 0.0 | 0.0 | 0.0 | 0.0 | 4500.0 | ... | 65340.000 | 0.0 | 0 | 0.0 | 0.0 | 4.0 | Active | 0 | 0 | 1 |
| 46 | 2843461 | 210644 | -70 | 66995.280 | 90000 | 0.0 | 0.0 | 0.0 | 0.0 | 4500.0 | ... | 66995.280 | 0.0 | 0 | 0.0 | 0.0 | 3.0 | Active | 0 | 0 | 1 |
| 11 | 2843461 | 210644 | -71 | 82152.180 | 90000 | 0.0 | 0.0 | 0.0 | 0.0 | 4500.0 | ... | 82152.180 | 0.0 | 0 | 0.0 | 0.0 | 2.0 | Active | 1 | 1 | 1 |
| 20 | 2843461 | 210644 | -72 | 78989.535 | 90000 | 0.0 | 0.0 | 0.0 | 0.0 | 4500.0 | ... | 78989.535 | 0.0 | 0 | 0.0 | 0.0 | 1.0 | Active | 0 | 0 | 1 |
| 68 | 2843461 | 210644 | -73 | 89685.000 | 90000 | 0.0 | 87660.0 | 0.0 | 87660.0 | NaN | ... | 89685.000 | 0.0 | 2 | 0.0 | 2.0 | NaN | Active | 0 | 0 | 1 |
73 rows × 24 columns
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 42 | 2843478 | 424526 | -2 | 0.000 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 | 0 |
| 2 | 2843478 | 424526 | -3 | 0.000 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 | 0 |
| 32 | 2843478 | 424526 | -4 | 0.000 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 | 0 |
| 11 | 2843478 | 424526 | -5 | 0.000 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 | 0 |
| 45 | 2843478 | 424526 | -6 | 0.000 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26 | 2843478 | 424526 | -87 | 59249.475 | 90000 | 0.0 | 0.0 | 0.0 | 0.0 | 4500.0 | ... | 59249.475 | 0.0 | 0 | 0.0 | 0.0 | 4.0 | Active | 0 | 0 | 0 |
| 69 | 2843478 | 424526 | -88 | 71238.330 | 90000 | 0.0 | 0.0 | 0.0 | 0.0 | 4500.0 | ... | 71238.330 | 0.0 | 0 | 0.0 | 0.0 | 3.0 | Active | 0 | 0 | 0 |
| 70 | 2843478 | 424526 | -89 | 77646.555 | 90000 | 22500.0 | 22500.0 | 0.0 | 0.0 | 4500.0 | ... | 77646.555 | 1.0 | 1 | 0.0 | 0.0 | 2.0 | Active | 0 | 0 | 0 |
| 47 | 2843478 | 424526 | -90 | 66824.820 | 90000 | 0.0 | 0.0 | 0.0 | 0.0 | 4500.0 | ... | 66824.820 | 0.0 | 0 | 0.0 | 0.0 | 1.0 | Active | 0 | 0 | 0 |
| 5 | 2843478 | 424526 | -91 | 69369.750 | 90000 | 67500.0 | 67500.0 | 0.0 | 0.0 | NaN | ... | 69369.750 | 3.0 | 3 | 0.0 | 0.0 | NaN | Active | 0 | 0 | 0 |
90 rows × 24 columns
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | 2843493 | 337804 | -1 | 70445.790 | 225000 | 0.0 | 0.0 | 0.0 | 0.0 | 3674.790 | ... | 69261.345 | 0.0 | 0 | 0.0 | 0.0 | 13.0 | Active | 0 | 0 | 0 |
| 3 | 2843493 | 337804 | -2 | 74304.135 | 225000 | 0.0 | 0.0 | 0.0 | 0.0 | 3748.455 | ... | 73091.025 | 0.0 | 0 | 0.0 | 0.0 | 12.0 | Active | 0 | 0 | 0 |
| 4 | 2843493 | 337804 | -3 | 75970.080 | 225000 | 0.0 | 0.0 | 0.0 | 0.0 | 4043.655 | ... | 74746.170 | 0.0 | 0 | 0.0 | 0.0 | 11.0 | Active | 0 | 0 | 0 |
| 2 | 2843493 | 337804 | -4 | 72921.015 | 225000 | 0.0 | 0.0 | 0.0 | 0.0 | 4358.970 | ... | 72921.015 | 0.0 | 0 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 | 0 |
| 7 | 2843493 | 337804 | -5 | 78839.595 | 225000 | 0.0 | 28998.0 | 0.0 | 28998.0 | 2996.010 | ... | 78839.595 | 0.0 | 1 | 0.0 | 1.0 | 9.0 | Active | 0 | 0 | 0 |
| 6 | 2843493 | 337804 | -6 | 60719.130 | 135000 | 0.0 | 0.0 | 0.0 | 0.0 | 3093.435 | ... | 59606.910 | 0.0 | 0 | 0.0 | 0.0 | 8.0 | Active | 0 | 0 | 0 |
| 1 | 2843493 | 337804 | -7 | 62644.140 | 135000 | 0.0 | 0.0 | 0.0 | 0.0 | 3296.250 | ... | 61517.970 | 0.0 | 0 | 0.0 | 0.0 | 7.0 | Active | 0 | 0 | 0 |
| 13 | 2843493 | 337804 | -8 | 66804.030 | 135000 | 0.0 | 0.0 | 0.0 | 0.0 | 3381.885 | ... | 65647.350 | 0.0 | 0 | 0.0 | 0.0 | 6.0 | Active | 0 | 0 | 0 |
| 0 | 2843493 | 337804 | -9 | 68541.165 | 135000 | 0.0 | 0.0 | 0.0 | 0.0 | 3468.240 | ... | 67371.210 | 0.0 | 0 | 0.0 | 0.0 | 5.0 | Active | 0 | 0 | 0 |
| 8 | 2843493 | 337804 | -10 | 70222.185 | 135000 | 0.0 | 0.0 | 0.0 | 0.0 | 3496.230 | ... | 69047.910 | 0.0 | 0 | 0.0 | 0.0 | 4.0 | Active | 0 | 0 | 0 |
| 12 | 2843493 | 337804 | -11 | 70159.635 | 135000 | 0.0 | 3565.8 | 0.0 | 3565.8 | 2250.000 | ... | 68963.445 | 0.0 | 1 | 0.0 | 1.0 | 3.0 | Active | 0 | 0 | 0 |
| 5 | 2843493 | 337804 | -12 | 64969.380 | 135000 | 0.0 | 47839.5 | 0.0 | 47839.5 | 2250.000 | ... | 64969.380 | 0.0 | 2 | 0.0 | 2.0 | 2.0 | Active | 0 | 0 | 0 |
| 10 | 2843493 | 337804 | -13 | 25036.515 | 90000 | 0.0 | 2969.1 | 0.0 | 2969.1 | 2250.000 | ... | 24903.945 | 0.0 | 1 | 0.0 | 1.0 | 1.0 | Active | 0 | 0 | 0 |
| 11 | 2843493 | 337804 | -14 | 25522.110 | 90000 | 0.0 | 24885.0 | 0.0 | 24885.0 | 0.000 | ... | 24885.000 | 0.0 | 2 | 0.0 | 2.0 | 0.0 | Active | 0 | 0 | 0 |
| 9 | 2843493 | 337804 | -15 | 0.000 | 45000 | NaN | 0.0 | NaN | NaN | 0.000 | ... | 0.000 | NaN | 0 | NaN | NaN | 0.0 | Active | 0 | 0 | 0 |
15 rows × 24 columns
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.countplot(x='NAME_CONTRACT_STATUS', data=ccb,ax=ax)
ax.set_xlabel("Contract Type")
ax.set_ylabel("Counts")
ax.set_title("Different Types of Contract and their counts for cash credit balance")
Text(0.5, 1.0, 'Different Types of Contract and their counts for cash credit balance')
merged_df = pd.merge(ccb, app_train_req, how="left", on="SK_ID_CURR")
# set up the matplotlib figure
f, ax = plt.subplots(1,1, figsize=(25, 25), dpi=500)
# generate a mask for the lower triangle
mask = np.zeros_like(merged_df.corr(), dtype=np.bool_)
mask[np.triu_indices_from(mask)] = True
# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 11, as_cmap=True)
# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(merged_df.corr(), mask=mask, cmap=cmap, vmax=.3,
square=True,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax, annot=True);
ax.set_title("Correaltion matrix for Cash Credit Balance and Target ")
Text(0.5, 1.0, 'Correaltion matrix for Cash Credit Balance and Target ')
agg_data = ccb.groupby("SK_ID_PREV").agg(['mean','count','sum','min','max'])
agg_data.columns
MultiIndex([( 'SK_ID_CURR', 'mean'),
( 'SK_ID_CURR', 'count'),
( 'SK_ID_CURR', 'sum'),
( 'SK_ID_CURR', 'min'),
( 'SK_ID_CURR', 'max'),
('MONTHS_BALANCE', 'mean'),
('MONTHS_BALANCE', 'count'),
('MONTHS_BALANCE', 'sum'),
('MONTHS_BALANCE', 'min'),
('MONTHS_BALANCE', 'max'),
...
( 'SK_DPD', 'mean'),
( 'SK_DPD', 'count'),
( 'SK_DPD', 'sum'),
( 'SK_DPD', 'min'),
( 'SK_DPD', 'max'),
( 'SK_DPD_DEF', 'mean'),
( 'SK_DPD_DEF', 'count'),
( 'SK_DPD_DEF', 'sum'),
( 'SK_DPD_DEF', 'min'),
( 'SK_DPD_DEF', 'max')],
length=105)
# dir(agg_data.columns)
agg_data.columns.levels
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-20-8250f826960a> in <module>() 1 # dir(agg_data.columns) ----> 2 agg_data.columns.levels NameError: name 'agg_data' is not defined
print("-------------MONTHS_BALANCE--------------------")
display(agg_data.head(5)["MONTHS_BALANCE"])
print("-------------AMT_BALANCE--------------------")
display(agg_data.head(5)["AMT_BALANCE"])
print("-------------AMT_CREDIT_LIMIT_ACTUAL--------------------")
display(agg_data.head(5)["AMT_CREDIT_LIMIT_ACTUAL"])
print("-------------AMT_PAYMENT_CURRENT--------------------")
display(agg_data.head(5)["AMT_PAYMENT_CURRENT"])
print("-------------AMT_RECEIVABLE_PRINCIPAL--------------------")
display(agg_data.head(5)["AMT_RECEIVABLE_PRINCIPAL"])
-------------MONTHS_BALANCE--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_PREV | |||||
| 1000018 | -4.0 | 5 | -20 | -6 | -2 |
| 1000030 | -4.5 | 8 | -36 | -8 | -1 |
| 1000031 | -8.5 | 16 | -136 | -16 | -1 |
| 1000035 | -4.0 | 5 | -20 | -6 | -2 |
| 1000077 | -7.0 | 11 | -77 | -12 | -2 |
-------------AMT_BALANCE--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_PREV | |||||
| 1000018 | 74946.285000 | 5 | 374731.425 | 38879.145 | 136695.420 |
| 1000030 | 55991.064375 | 8 | 447928.515 | 0.000 | 103027.275 |
| 1000031 | 52394.439375 | 16 | 838311.030 | 0.000 | 154945.935 |
| 1000035 | 0.000000 | 5 | 0.000 | 0.000 | 0.000 |
| 1000077 | 0.000000 | 11 | 0.000 | 0.000 | 0.000 |
-------------AMT_CREDIT_LIMIT_ACTUAL--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_PREV | |||||
| 1000018 | 81000.000000 | 5 | 405000 | 45000 | 135000 |
| 1000030 | 81562.500000 | 8 | 652500 | 45000 | 135000 |
| 1000031 | 149625.000000 | 16 | 2394000 | 45000 | 225000 |
| 1000035 | 225000.000000 | 5 | 1125000 | 225000 | 225000 |
| 1000077 | 94090.909091 | 11 | 1035000 | 45000 | 135000 |
-------------AMT_PAYMENT_CURRENT--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_PREV | |||||
| 1000018 | 5541.750000 | 5 | 27708.75 | 3190.635 | 9000.00 |
| 1000030 | 6188.631429 | 7 | 43320.42 | 2371.815 | 16067.25 |
| 1000031 | 29543.257500 | 12 | 354519.09 | 394.065 | 160606.80 |
| 1000035 | NaN | 0 | 0.00 | NaN | NaN |
| 1000077 | NaN | 0 | 0.00 | NaN | NaN |
-------------AMT_RECEIVABLE_PRINCIPAL--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_PREV | |||||
| 1000018 | 72298.197000 | 5 | 361490.985 | 37542.645 | 132903.000 |
| 1000030 | 55474.453125 | 8 | 443795.625 | 0.000 | 101866.725 |
| 1000031 | 51402.878437 | 16 | 822446.055 | 0.000 | 154945.935 |
| 1000035 | 0.000000 | 5 | 0.000 | 0.000 | 0.000 |
| 1000077 | 0.000000 | 11 | 0.000 | 0.000 | 0.000 |
pcb = datasets["POS_CASH_balance"]
pcb.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB
pcb.head(5)
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
pcb.describe()
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|
| count | 1.000136e+07 | 1.000136e+07 | 1.000136e+07 | 9.975287e+06 | 9.975271e+06 | 1.000136e+07 | 1.000136e+07 |
| mean | 1.903217e+06 | 2.784039e+05 | -3.501259e+01 | 1.708965e+01 | 1.048384e+01 | 1.160693e+01 | 6.544684e-01 |
| std | 5.358465e+05 | 1.027637e+05 | 2.606657e+01 | 1.199506e+01 | 1.110906e+01 | 1.327140e+02 | 3.276249e+01 |
| min | 1.000001e+06 | 1.000010e+05 | -9.600000e+01 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434405e+06 | 1.895500e+05 | -5.400000e+01 | 1.000000e+01 | 3.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 1.896565e+06 | 2.786540e+05 | -2.800000e+01 | 1.200000e+01 | 7.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 75% | 2.368963e+06 | 3.674290e+05 | -1.300000e+01 | 2.400000e+01 | 1.400000e+01 | 0.000000e+00 | 0.000000e+00 |
| max | 2.843499e+06 | 4.562550e+05 | -1.000000e+00 | 9.200000e+01 | 8.500000e+01 | 4.231000e+03 | 3.595000e+03 |
app_train_pcb = pd.merge(app_train_req, pcb[["SK_ID_CURR","SK_ID_PREV"]], how="left")
app_train_pcb.fillna(0, inplace=True)
app_train_pcb.SK_ID_PREV=app_train_pcb.SK_ID_PREV.astype(int)
display(app_train_pcb[app_train_pcb.TARGET==1].sort_values(by="SK_ID_PREV",ascending=False).head(15))
display(app_train_pcb[app_train_pcb.TARGET==0].sort_values(by="SK_ID_PREV",ascending=False).head(5))
| SK_ID_CURR | TARGET | SK_ID_PREV | |
|---|---|---|---|
| 3850632 | 260963 | 1 | 2843495 |
| 3850647 | 260963 | 1 | 2843495 |
| 3850638 | 260963 | 1 | 2843495 |
| 3850643 | 260963 | 1 | 2843495 |
| 3850644 | 260963 | 1 | 2843495 |
| 3850645 | 260963 | 1 | 2843495 |
| 3850646 | 260963 | 1 | 2843495 |
| 3850642 | 260963 | 1 | 2843495 |
| 5279022 | 320127 | 1 | 2843481 |
| 5279034 | 320127 | 1 | 2843481 |
| 5279021 | 320127 | 1 | 2843481 |
| 5279020 | 320127 | 1 | 2843481 |
| 5279014 | 320127 | 1 | 2843481 |
| 5279013 | 320127 | 1 | 2843481 |
| 5279027 | 320127 | 1 | 2843481 |
| SK_ID_CURR | TARGET | SK_ID_PREV | |
|---|---|---|---|
| 5135599 | 314148 | 0 | 2843499 |
| 5135574 | 314148 | 0 | 2843499 |
| 5135578 | 314148 | 0 | 2843499 |
| 5135591 | 314148 | 0 | 2843499 |
| 5135592 | 314148 | 0 | 2843499 |
pcb.SK_ID_PREV.value_counts()
1856103 96
2706683 96
1617536 96
1364606 96
1057553 96
..
1922777 1
2660098 1
1364218 1
1077449 1
1191779 1
Name: SK_ID_PREV, Length: 936325, dtype: int64
SK_ID_PREV to gain insights¶observation_ids = [2843495, 2843499, 2843481]
for grp, df in pcb.groupby("SK_ID_PREV"):
if grp in observation_ids:
display(pd.merge(df, app_train_req, how="left").sort_values(by="MONTHS_BALANCE", ascending=False))
observation_ids.remove(grp)
if len(observation_ids) ==0:
break
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | TARGET | |
|---|---|---|---|---|---|---|---|---|---|
| 2 | 2843481 | 320127 | -78 | 10.0 | 0.0 | Completed | 0 | 0 | 1 |
| 6 | 2843481 | 320127 | -79 | 10.0 | 1.0 | Active | 0 | 0 | 1 |
| 0 | 2843481 | 320127 | -80 | 10.0 | 2.0 | Active | 0 | 0 | 1 |
| 3 | 2843481 | 320127 | -81 | 10.0 | 3.0 | Active | 0 | 0 | 1 |
| 7 | 2843481 | 320127 | -82 | 10.0 | 4.0 | Active | 0 | 0 | 1 |
| 9 | 2843481 | 320127 | -83 | 10.0 | 5.0 | Active | 0 | 0 | 1 |
| 1 | 2843481 | 320127 | -84 | 10.0 | 6.0 | Active | 0 | 0 | 1 |
| 4 | 2843481 | 320127 | -85 | 10.0 | 7.0 | Active | 0 | 0 | 1 |
| 8 | 2843481 | 320127 | -86 | 10.0 | 8.0 | Active | 0 | 0 | 1 |
| 10 | 2843481 | 320127 | -87 | 10.0 | 9.0 | Active | 0 | 0 | 1 |
| 5 | 2843481 | 320127 | -88 | 10.0 | 10.0 | Active | 0 | 0 | 1 |
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | TARGET | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2843495 | 260963 | -9 | 7.0 | 0.0 | Completed | 0 | 0 | 1 |
| 4 | 2843495 | 260963 | -10 | 60.0 | 54.0 | Active | 0 | 0 | 1 |
| 3 | 2843495 | 260963 | -11 | 60.0 | 55.0 | Active | 0 | 0 | 1 |
| 6 | 2843495 | 260963 | -12 | 60.0 | 56.0 | Active | 0 | 0 | 1 |
| 2 | 2843495 | 260963 | -13 | 60.0 | 57.0 | Active | 0 | 0 | 1 |
| 1 | 2843495 | 260963 | -14 | 60.0 | 58.0 | Active | 0 | 0 | 1 |
| 7 | 2843495 | 260963 | -15 | 60.0 | 59.0 | Active | 0 | 0 | 1 |
| 5 | 2843495 | 260963 | -16 | 60.0 | 60.0 | Active | 0 | 0 | 1 |
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | TARGET | |
|---|---|---|---|---|---|---|---|---|---|
| 8 | 2843499 | 314148 | -30 | 10.0 | 0.0 | Completed | 0 | 0 | 0 |
| 5 | 2843499 | 314148 | -31 | 10.0 | 0.0 | Active | 0 | 0 | 0 |
| 1 | 2843499 | 314148 | -32 | 60.0 | 51.0 | Active | 0 | 0 | 0 |
| 2 | 2843499 | 314148 | -33 | 60.0 | 52.0 | Active | 0 | 0 | 0 |
| 7 | 2843499 | 314148 | -34 | 60.0 | 54.0 | Active | 0 | 0 | 0 |
| 0 | 2843499 | 314148 | -35 | 60.0 | 55.0 | Active | 0 | 0 | 0 |
| 10 | 2843499 | 314148 | -36 | 60.0 | 56.0 | Active | 0 | 0 | 0 |
| 6 | 2843499 | 314148 | -37 | 60.0 | 57.0 | Active | 0 | 0 | 0 |
| 9 | 2843499 | 314148 | -38 | 60.0 | 58.0 | Active | 0 | 0 | 0 |
| 4 | 2843499 | 314148 | -39 | 60.0 | 59.0 | Active | 0 | 0 | 0 |
| 3 | 2843499 | 314148 | -40 | 60.0 | 60.0 | Active | 0 | 0 | 0 |
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
sns.countplot(x='NAME_CONTRACT_STATUS', data=pcb,ax=ax)
ax.set_xlabel("Contract Type")
ax.set_ylabel("Counts")
ax.set_title("Different Types of Contract and their counts for dataset: POS_CASH_BALANCE")
Text(0.5, 1.0, 'Different Types of Contract and their counts for dataset: POS_CASH_BALANCE')
merged_df = pd.merge(pcb, app_train_req, how="left", on="SK_ID_CURR")
# set up the matplotlib figure
f, ax = plt.subplots(1,1, figsize=(25, 25), dpi=500)
# generate a mask for the lower triangle
mask = np.zeros_like(merged_df.corr(), dtype=np.bool_)
mask[np.triu_indices_from(mask)] = True
# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 11, as_cmap=True)
# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(merged_df.corr(), mask=mask, cmap=cmap, vmax=.3,
square=True,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax, annot=True);
ax.set_title("Correaltion matrix for POS_CASH_BALANCE and Target ")
Text(0.5, 1.0, 'Correaltion matrix for POS_CASH_BALANCE and Target ')
agg_data = pcb.groupby("SK_ID_PREV").agg(['mean','count','sum','min','max'])
agg_data.columns
MultiIndex([( 'SK_ID_CURR', 'mean'),
( 'SK_ID_CURR', 'count'),
( 'SK_ID_CURR', 'sum'),
( 'SK_ID_CURR', 'min'),
( 'SK_ID_CURR', 'max'),
( 'MONTHS_BALANCE', 'mean'),
( 'MONTHS_BALANCE', 'count'),
( 'MONTHS_BALANCE', 'sum'),
( 'MONTHS_BALANCE', 'min'),
( 'MONTHS_BALANCE', 'max'),
( 'CNT_INSTALMENT', 'mean'),
( 'CNT_INSTALMENT', 'count'),
( 'CNT_INSTALMENT', 'sum'),
( 'CNT_INSTALMENT', 'min'),
( 'CNT_INSTALMENT', 'max'),
('CNT_INSTALMENT_FUTURE', 'mean'),
('CNT_INSTALMENT_FUTURE', 'count'),
('CNT_INSTALMENT_FUTURE', 'sum'),
('CNT_INSTALMENT_FUTURE', 'min'),
('CNT_INSTALMENT_FUTURE', 'max'),
( 'SK_DPD', 'mean'),
( 'SK_DPD', 'count'),
( 'SK_DPD', 'sum'),
( 'SK_DPD', 'min'),
( 'SK_DPD', 'max'),
( 'SK_DPD_DEF', 'mean'),
( 'SK_DPD_DEF', 'count'),
( 'SK_DPD_DEF', 'sum'),
( 'SK_DPD_DEF', 'min'),
( 'SK_DPD_DEF', 'max')],
)
print("-------------MONTHS_BALANCE--------------------")
display(agg_data.head(5)["MONTHS_BALANCE"])
print("-------------CNT_INSTALMENT--------------------")
display(agg_data.head(5)["CNT_INSTALMENT"])
print("-------------CNT_INSTALMENT_FUTURE--------------------")
display(agg_data.head(5)["CNT_INSTALMENT_FUTURE"])
print("-------------SK_DPD--------------------")
display(agg_data.head(5)["SK_DPD"])
-------------MONTHS_BALANCE--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_PREV | |||||
| 1000001 | -9.0 | 3 | -27 | -10 | -8 |
| 1000002 | -52.0 | 5 | -260 | -54 | -50 |
| 1000003 | -2.5 | 4 | -10 | -4 | -1 |
| 1000004 | -25.5 | 8 | -204 | -29 | -22 |
| 1000005 | -51.0 | 11 | -561 | -56 | -46 |
-------------CNT_INSTALMENT--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_PREV | |||||
| 1000001 | 8.666667 | 3 | 26.0 | 2.0 | 12.0 |
| 1000002 | 5.200000 | 5 | 26.0 | 4.0 | 6.0 |
| 1000003 | 12.000000 | 4 | 48.0 | 12.0 | 12.0 |
| 1000004 | 9.625000 | 8 | 77.0 | 7.0 | 10.0 |
| 1000005 | 10.000000 | 11 | 110.0 | 10.0 | 10.0 |
-------------CNT_INSTALMENT_FUTURE--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_PREV | |||||
| 1000001 | 7.666667 | 3 | 23.0 | 0.0 | 12.0 |
| 1000002 | 2.000000 | 5 | 10.0 | 0.0 | 4.0 |
| 1000003 | 10.500000 | 4 | 42.0 | 9.0 | 12.0 |
| 1000004 | 6.125000 | 8 | 49.0 | 0.0 | 10.0 |
| 1000005 | 5.000000 | 11 | 55.0 | 0.0 | 10.0 |
-------------SK_DPD--------------------
| mean | count | sum | min | max | |
|---|---|---|---|---|---|
| SK_ID_PREV | |||||
| 1000001 | 0.0 | 3 | 0 | 0 | 0 |
| 1000002 | 0.0 | 5 | 0 | 0 | 0 |
| 1000003 | 0.0 | 4 | 0 | 0 | 0 |
| 1000004 | 0.0 | 8 | 0 | 0 | 0 |
| 1000005 | 0.0 | 11 | 0 | 0 | 0 |
ip = datasets['installments_payments']
ip.head(5)
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
ip.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB
ip.describe()
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| count | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360250e+07 | 1.360540e+07 | 1.360250e+07 |
| mean | 1.903365e+06 | 2.784449e+05 | 8.566373e-01 | 1.887090e+01 | -1.042270e+03 | -1.051114e+03 | 1.705091e+04 | 1.723822e+04 |
| std | 5.362029e+05 | 1.027183e+05 | 1.035216e+00 | 2.666407e+01 | 8.009463e+02 | 8.005859e+02 | 5.057025e+04 | 5.473578e+04 |
| min | 1.000001e+06 | 1.000010e+05 | 0.000000e+00 | 1.000000e+00 | -2.922000e+03 | -4.921000e+03 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434191e+06 | 1.896390e+05 | 0.000000e+00 | 4.000000e+00 | -1.654000e+03 | -1.662000e+03 | 4.226085e+03 | 3.398265e+03 |
| 50% | 1.896520e+06 | 2.786850e+05 | 1.000000e+00 | 8.000000e+00 | -8.180000e+02 | -8.270000e+02 | 8.884080e+03 | 8.125515e+03 |
| 75% | 2.369094e+06 | 3.675300e+05 | 1.000000e+00 | 1.900000e+01 | -3.610000e+02 | -3.700000e+02 | 1.671021e+04 | 1.610842e+04 |
| max | 2.843499e+06 | 4.562550e+05 | 1.780000e+02 | 2.770000e+02 | -1.000000e+00 | -1.000000e+00 | 3.771488e+06 | 3.771488e+06 |
ip.isna().sum()
SK_ID_PREV 0 SK_ID_CURR 0 NUM_INSTALMENT_VERSION 0 NUM_INSTALMENT_NUMBER 0 DAYS_INSTALMENT 0 DAYS_ENTRY_PAYMENT 2905 AMT_INSTALMENT 0 AMT_PAYMENT 2905 dtype: int64
app_train = datasets['application_train']
app_train_req = app_train[["SK_ID_CURR", "TARGET"]]
app_train_ip = pd.merge(app_train_req, ip[["SK_ID_CURR","SK_ID_PREV"]], how="left")
app_train_ip.fillna(0, inplace=True)
app_train_ip.SK_ID_PREV=app_train_ip.SK_ID_PREV.astype(int)
display(app_train_ip[app_train_ip.TARGET==1].sort_values(by="SK_ID_PREV",ascending=False).head(5))
display(app_train_ip[app_train_ip.TARGET==0].sort_values(by="SK_ID_PREV",ascending=False).head(5))
| SK_ID_CURR | TARGET | SK_ID_PREV | |
|---|---|---|---|
| 5219784 | 260963 | 1 | 2843495 |
| 5219782 | 260963 | 1 | 2843495 |
| 5219789 | 260963 | 1 | 2843495 |
| 5219788 | 260963 | 1 | 2843495 |
| 5219779 | 260963 | 1 | 2843495 |
| SK_ID_CURR | TARGET | SK_ID_PREV | |
|---|---|---|---|
| 6957373 | 314148 | 0 | 2843499 |
| 6957375 | 314148 | 0 | 2843499 |
| 6957389 | 314148 | 0 | 2843499 |
| 6957388 | 314148 | 0 | 2843499 |
| 6957382 | 314148 | 0 | 2843499 |
SK_ID_PREV to gain insights¶observation_ids = [2843499, 2843495, 2843461]
for grp, df in ip.groupby("SK_ID_PREV"):
if grp in observation_ids:
display(pd.merge(df, app_train_req, how="left").sort_values(by="NUM_INSTALMENT_NUMBER"))
observation_ids.remove(grp)
if len(observation_ids) ==0:
break
merged_df = pd.merge(ip, app_train_req, how="left", on="SK_ID_CURR")
# set up the matplotlib figure
f, ax = plt.subplots(1,1, figsize=(25, 25))
# generate a mask for the lower triangle
mask = np.zeros_like(merged_df.corr(), dtype=np.bool_)
mask[np.triu_indices_from(mask)] = True
# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 11, as_cmap=True)
# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(merged_df.corr(), mask=mask, cmap=cmap, vmax=.3,
square=True,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax, annot=True);
ax.set_title("Correaltion matrix for Installment payments and Target ")
ip.NUM_INSTALMENT_VERSION.unique()
agg_data = ip.groupby("SK_ID_PREV").agg(['mean','count','sum','min','max'])
agg_data.columns
agg_data.head(5)["AMT_INSTALMENT"]
agg_data.head(5)["AMT_PAYMENT"]
datasets.keys()
dict_keys(['application_train', 'application_test', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance'])
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
True
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
array([], dtype=int64)
datasets["application_test"].shape
(48744, 121)
datasets["application_train"].shape
(160892, 122)
The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.
appsDF = datasets["previous_application"]
appsDF.shape
(1670214, 37)
len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"]))
47800
print(f"There are {appsDF.shape[0]:,} previous applications")
There are 1,670,214 previous applications
# How many entries are there for each month?
prevAppCounts = appsDF['SK_ID_CURR'].value_counts(dropna=False)
len(prevAppCounts[prevAppCounts >40]) #more that 40 previous applications
101
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
prevAppCounts[prevAppCounts >50].plot(kind='barh', ax=ax)
ax.set_title("Applicants with previous application coutn of more than 50")
Text(0.5, 1.0, 'Applicants with previous application coutn of more than 50')
sum(appsDF['SK_ID_CURR'].value_counts()==1)
60458
fig, ax = plt.subplots(1,1, figsize=(10,10), dpi=400)
plt.hist(appsDF['SK_ID_CURR'].value_counts(), cumulative =True, bins = 100, histtype="barstacked")
ax.set_ylabel('cumulative number of IDs')
ax.set_xlabel('Number of previous applications per ID')
ax.set_title('Histogram of Number of previous applications for an ID')
Text(0.5, 1.0, 'Histogram of Number of previous applications for an ID')
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)
apps_all = appsDF['SK_ID_CURR'].nunique()
apps_5plus = appsDF['SK_ID_CURR'].value_counts()<=5
apps_40plus = appsDF['SK_ID_CURR'].value_counts()>=40
print('Percentage with 5 or less previous apps:', np.round(100.*(sum(apps_5plus)/apps_all),5))
print('Percentage with 40 or more previous apps:', np.round(100.*(sum(apps_40plus)/apps_all),5))
Percentage with 5 or less previous apps: 67.34581 Percentage with 40 or more previous apps: 0.03453
In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?
previous_application with application_x¶We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.
Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:
AMT_APPLICATION, AMT_CREDIT could be based on average, min, max, median, etc.To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).
When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]I want you to think about this section and build on this.
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset)), thereby leading to X_train, y_train, X_valid, etc.df = pd.DataFrame([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[np.nan, np.nan, np.nan]],
columns=['A', 'B', 'C'])
df
| A | B | C | |
|---|---|---|---|
| 0 | 1.0 | 2.0 | 3.0 |
| 1 | 4.0 | 5.0 | 6.0 |
| 2 | 7.0 | 8.0 | 9.0 |
| 3 | NaN | NaN | NaN |
df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
# A B
#max NaN 8.0
#min 1.0 2.0
#sum 12.0 NaN
| A | B | |
|---|---|---|
| sum | 12.0 | NaN |
| min | 1.0 | 2.0 |
| max | NaN | 8.0 |
df = pd.DataFrame({'A': [1, 1, 2, 2],
'B': [1, 2, 3, 4],
'C': np.random.randn(4)})
df
| A | B | C | |
|---|---|---|---|
| 0 | 1 | 1 | 1.659589 |
| 1 | 1 | 2 | 0.377106 |
| 2 | 2 | 3 | -1.816287 |
| 3 | 2 | 4 | -0.390450 |
df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})
# B C
# min max sum
#A
#1 1 2 0.590716
#2 3 4 0.704907
| B | C | ||
|---|---|---|---|
| min | max | sum | |
| A | |||
| 1 | 1 | 2 | 2.036695 |
| 2 | 3 | 4 | -2.206736 |
funcs = ["a","b","c"]
{f:f"{f}_max" for f in funcs}
{'a': 'a_max', 'b': 'b_max', 'c': 'c_max'}
So far, both our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.
Use &, | , ~ Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with pandas. The details of why are explained here.
You must use the following operators with pandas:
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | TUESDAY | 11 | ... | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 37 columns
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704) & ~(appsDF["AMT_CREDIT"]==1.0)]
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | TUESDAY | 11 | ... | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 37 columns
appsDF.isna().sum()
SK_ID_PREV 0 SK_ID_CURR 0 NAME_CONTRACT_TYPE 0 AMT_ANNUITY 372235 AMT_APPLICATION 0 AMT_CREDIT 1 AMT_DOWN_PAYMENT 895844 AMT_GOODS_PRICE 385515 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 FLAG_LAST_APPL_PER_CONTRACT 0 NFLAG_LAST_APPL_IN_DAY 0 RATE_DOWN_PAYMENT 895844 RATE_INTEREST_PRIMARY 1664263 RATE_INTEREST_PRIVILEGED 1664263 NAME_CASH_LOAN_PURPOSE 0 NAME_CONTRACT_STATUS 0 DAYS_DECISION 0 NAME_PAYMENT_TYPE 0 CODE_REJECT_REASON 0 NAME_TYPE_SUITE 820405 NAME_CLIENT_TYPE 0 NAME_GOODS_CATEGORY 0 NAME_PORTFOLIO 0 NAME_PRODUCT_TYPE 0 CHANNEL_TYPE 0 SELLERPLACE_AREA 0 NAME_SELLER_INDUSTRY 0 CNT_PAYMENT 372230 NAME_YIELD_GROUP 0 PRODUCT_COMBINATION 346 DAYS_FIRST_DRAWING 673065 DAYS_FIRST_DUE 673065 DAYS_LAST_DUE_1ST_VERSION 673065 DAYS_LAST_DUE 673065 DAYS_TERMINATION 673065 NFLAG_INSURED_ON_APPROVAL 673065 dtype: int64
appsDF.columns
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'NAME_CASH_LOAN_PURPOSE',
'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE',
'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
dtype='object')
Previous_application analysis¶appsDF
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1670209 | 2300464 | 352015 | Consumer loans | 14704.290 | 267295.5 | 311400.0 | 0.0 | 267295.5 | WEDNESDAY | 12 | ... | Furniture | 30.0 | low_normal | POS industry with interest | 365243.0 | -508.0 | 362.0 | -358.0 | -351.0 | 0.0 |
| 1670210 | 2357031 | 334635 | Consumer loans | 6622.020 | 87750.0 | 64291.5 | 29250.0 | 87750.0 | TUESDAY | 15 | ... | Furniture | 12.0 | middle | POS industry with interest | 365243.0 | -1604.0 | -1274.0 | -1304.0 | -1297.0 | 0.0 |
| 1670211 | 2659632 | 249544 | Consumer loans | 11520.855 | 105237.0 | 102523.5 | 10525.5 | 105237.0 | MONDAY | 12 | ... | Consumer electronics | 10.0 | low_normal | POS household with interest | 365243.0 | -1457.0 | -1187.0 | -1187.0 | -1181.0 | 0.0 |
| 1670212 | 2785582 | 400317 | Cash loans | 18821.520 | 180000.0 | 191880.0 | NaN | 180000.0 | WEDNESDAY | 9 | ... | XNA | 12.0 | low_normal | Cash X-Sell: low | 365243.0 | -1155.0 | -825.0 | -825.0 | -817.0 | 1.0 |
| 1670213 | 2418762 | 261212 | Cash loans | 16431.300 | 360000.0 | 360000.0 | NaN | 360000.0 | SUNDAY | 10 | ... | XNA | 48.0 | middle | Cash X-Sell: middle | 365243.0 | -1163.0 | 247.0 | -443.0 | -423.0 | 0.0 |
1670214 rows × 37 columns
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
agg_op_features = {}
for f in features: #build agg dictionary
agg_op_features[f]=[]
agg_op_features[f].extend((f"{f}_{func}",func) for func in ["min", "max", "mean"])
print(f"{appsDF[features].describe()}")
print("\n\n\n Required Features...")
print(agg_op_features)
result = appsDF.groupby(["SK_ID_CURR"]).agg(agg_op_features)
result.columns = result.columns.droplevel() #drop 1 of the header row but keep the feature name header row
result = result.reset_index(level=["SK_ID_CURR"])
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
print(f"---------------------\n\n\n result.shape: {result.shape}")
display(result.head(10))
AMT_ANNUITY AMT_APPLICATION
count 1.297979e+06 1.670214e+06
mean 1.595512e+04 1.752339e+05
std 1.478214e+04 2.927798e+05
min 0.000000e+00 0.000000e+00
25% 6.321780e+03 1.872000e+04
50% 1.125000e+04 7.104600e+04
75% 2.065842e+04 1.803600e+05
max 4.180581e+05 6.905160e+06
Required Features...
{'AMT_ANNUITY': [('AMT_ANNUITY_min', 'min'), ('AMT_ANNUITY_max', 'max'), ('AMT_ANNUITY_mean', 'mean')], 'AMT_APPLICATION': [('AMT_APPLICATION_min', 'min'), ('AMT_APPLICATION_max', 'max'), ('AMT_APPLICATION_mean', 'mean')]}
---------------------
result.shape: (338857, 8)
| SK_ID_CURR | AMT_ANNUITY_min | AMT_ANNUITY_max | AMT_ANNUITY_mean | AMT_APPLICATION_min | AMT_APPLICATION_max | AMT_APPLICATION_mean | range_AMT_APPLICATION | |
|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 3951.000 | 3951.000 | 3951.000000 | 24835.5 | 24835.5 | 24835.500000 | 0.0 |
| 1 | 100002 | 9251.775 | 9251.775 | 9251.775000 | 179055.0 | 179055.0 | 179055.000000 | 0.0 |
| 2 | 100003 | 6737.310 | 98356.995 | 56553.990000 | 68809.5 | 900000.0 | 435436.500000 | 831190.5 |
| 3 | 100004 | 5357.250 | 5357.250 | 5357.250000 | 24282.0 | 24282.0 | 24282.000000 | 0.0 |
| 4 | 100005 | 4813.200 | 4813.200 | 4813.200000 | 0.0 | 44617.5 | 22308.750000 | 44617.5 |
| 5 | 100006 | 2482.920 | 39954.510 | 23651.175000 | 0.0 | 688500.0 | 272203.260000 | 688500.0 |
| 6 | 100007 | 1834.290 | 22678.785 | 12278.805000 | 17176.5 | 247500.0 | 150530.250000 | 230323.5 |
| 7 | 100008 | 8019.090 | 25309.575 | 15839.696250 | 0.0 | 450000.0 | 155701.800000 | 450000.0 |
| 8 | 100009 | 7435.845 | 17341.605 | 10051.412143 | 40455.0 | 110160.0 | 76741.714286 | 69705.0 |
| 9 | 100010 | 27463.410 | 27463.410 | 27463.410000 | 247212.0 | 247212.0 | 247212.000000 | 0.0 |
agg_op_features
{'AMT_ANNUITY': [('AMT_ANNUITY_min', 'min'),
('AMT_ANNUITY_max', 'max'),
('AMT_ANNUITY_mean', 'mean')],
'AMT_APPLICATION': [('AMT_APPLICATION_min', 'min'),
('AMT_APPLICATION_max', 'max'),
('AMT_APPLICATION_mean', 'mean')]}
result.isna().sum()
SK_ID_CURR 0 AMT_ANNUITY_min 480 AMT_ANNUITY_max 480 AMT_ANNUITY_mean 480 AMT_APPLICATION_min 0 AMT_APPLICATION_max 0 AMT_APPLICATION_mean 0 range_AMT_APPLICATION 0 dtype: int64
from sklearn.pipeline import make_pipeline
class prevAppsFeaturesAggregater(BaseEstimator, TransformerMixin):
def __init__(self, features=None, prevApp=1): # no *args or **kargs
self.prevApp=prevApp
self.features = features
self.agg_op_features = {}
for f in features:
self.agg_op_features[f]=[]
self.agg_op_features[f].extend((f"{f}_{func}",func) for func in ["min", "max", "mean"])
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
####################-- Python Debugging---################################
# from IPython.core.debugger
# import Pdb as pdb
# pdb().set_trace()
# breakpoint dont forget to quit
###########################################################
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
result.columns = result.columns.droplevel()
result = result.reset_index(level=["SK_ID_CURR"])
if self.prevApp:
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
return result
# todo ---
# return dataframe with the join key "SK_ID_CURR"
def test_driver_prevAppsFeaturesAggregater(df, features):
print("Executing the test driver............")
print(f"df.shape: {df.shape}\n")
print(f"df[{features}][0:5]: \n")
display(df[features].head(5))
print("---- Testing with `make_pipeline`---------")
test_pipeline = make_pipeline(prevAppsFeaturesAggregater(features))
return(test_pipeline.fit_transform(df))
# All features of previous applications .....
features = ['AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CNT_PAYMENT',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION']
# Features of interest.....
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
print("\n\n----- Results ----------")
print(f"Test driver: \n")
display(res.head(10))
print(f"input[features][0:10]: \n")
display(appsDF.head(10))
# QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
Executing the test driver............ df.shape: (1670214, 37) df[['AMT_ANNUITY', 'AMT_APPLICATION']][0:5]:
| AMT_ANNUITY | AMT_APPLICATION | |
|---|---|---|
| 0 | 1730.430 | 17145.0 |
| 1 | 25188.615 | 607500.0 |
| 2 | 15060.735 | 112500.0 |
| 3 | 47041.335 | 450000.0 |
| 4 | 31924.395 | 337500.0 |
---- Testing with `make_pipeline`--------- ----- Results ---------- Test driver:
| SK_ID_CURR | AMT_ANNUITY_min | AMT_ANNUITY_max | AMT_ANNUITY_mean | AMT_APPLICATION_min | AMT_APPLICATION_max | AMT_APPLICATION_mean | range_AMT_APPLICATION | |
|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 3951.000 | 3951.000 | 3951.000000 | 24835.5 | 24835.5 | 24835.500000 | 0.0 |
| 1 | 100002 | 9251.775 | 9251.775 | 9251.775000 | 179055.0 | 179055.0 | 179055.000000 | 0.0 |
| 2 | 100003 | 6737.310 | 98356.995 | 56553.990000 | 68809.5 | 900000.0 | 435436.500000 | 831190.5 |
| 3 | 100004 | 5357.250 | 5357.250 | 5357.250000 | 24282.0 | 24282.0 | 24282.000000 | 0.0 |
| 4 | 100005 | 4813.200 | 4813.200 | 4813.200000 | 0.0 | 44617.5 | 22308.750000 | 44617.5 |
| 5 | 100006 | 2482.920 | 39954.510 | 23651.175000 | 0.0 | 688500.0 | 272203.260000 | 688500.0 |
| 6 | 100007 | 1834.290 | 22678.785 | 12278.805000 | 17176.5 | 247500.0 | 150530.250000 | 230323.5 |
| 7 | 100008 | 8019.090 | 25309.575 | 15839.696250 | 0.0 | 450000.0 | 155701.800000 | 450000.0 |
| 8 | 100009 | 7435.845 | 17341.605 | 10051.412143 | 40455.0 | 110160.0 | 76741.714286 | 69705.0 |
| 9 | 100010 | 27463.410 | 27463.410 | 27463.410000 | 247212.0 | 247212.0 | 247212.000000 | 0.0 |
input[features][0:10]:
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 1383531 | 199383 | Cash loans | 23703.930 | 315000.0 | 340573.5 | NaN | 315000.0 | SATURDAY | 8 | ... | XNA | 18.0 | low_normal | Cash X-Sell: low | 365243.0 | -654.0 | -144.0 | -144.0 | -137.0 | 1.0 |
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | TUESDAY | 11 | ... | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | 1656711 | 296299 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | MONDAY | 7 | ... | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | 2367563 | 342292 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | MONDAY | 15 | ... | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | 2579447 | 334349 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | SATURDAY | 15 | ... | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
10 rows × 37 columns
~3==3
False
datasets.keys()
dict_keys(['application_train', 'application_test', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance'])
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
bureau_features = ['AMT_ANNUITY', 'AMT_CREDIT_SUM']
bb_features = ['MONTHS_BALANCE']
ccb_features = ['MONTHS_BALANCE', 'AMT_BALANCE', 'CNT_INSTALMENT_MATURE_CUM']
ip_features = ['AMT_INSTALMENT', 'AMT_PAYMENT']
prevApps_feature_pipeline = Pipeline([
# ('prevApps_add_features1', prevApps_add_features1()), # add some new features
# ('prevApps_add_features2', prevApps_add_features2()), # add some new features
('prevApps_aggregater', prevAppsFeaturesAggregater(features)), # Aggregate across old and new features
])
bureau_feature_pipeline = Pipeline([
# ('prevApps_add_features1', prevApps_add_features1()), # add some new features
# ('prevApps_add_features2', prevApps_add_features2()), # add some new features
('feature_aggregater', prevAppsFeaturesAggregater(bureau_features,prevApp=0)), # Aggregate across old and new features
])
bb_feature_pipeline = Pipeline([
# ('prevApps_add_features1', prevApps_add_features1()), # add some new features
# ('prevApps_add_features2', prevApps_add_features2()), # add some new features
('feature_aggregater', prevAppsFeaturesAggregater(bb_features,prevApp=0)), # Aggregate across old and new features
])
ccb_feature_pipeline = Pipeline([
# ('prevApps_add_features1', prevApps_add_features1()), # add some new features
# ('prevApps_add_features2', prevApps_add_features2()), # add some new features
('feature_aggregater', prevAppsFeaturesAggregater(ccb_features,prevApp=0)), # Aggregate across old and new features
])
ip_feature_pipeline = Pipeline([
# ('prevApps_add_features1', prevApps_add_features1()), # add some new features
# ('prevApps_add_features2', prevApps_add_features2()), # add some new features
('feature_aggregater', prevAppsFeaturesAggregater(ip_features,prevApp=0)), # Aggregate across old and new features
])
X_train= datasets["application_train"] #primary dataset
appsDF = datasets["previous_application"] #prev app
merge_all_data = True
# transform all the secondary tables
# 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
# 'previous_application', 'POS_CASH_balance'
bureauDF = datasets['bureau']
bbDF = datasets['bureau_balance']
ccbDF = datasets['credit_card_balance']
ipDF = datasets['installments_payments']
posDF = datasets['POS_CASH_balance']
if merge_all_data:
prevApps_aggregated = prevApps_feature_pipeline.transform(appsDF)
bureau_aggregated = bureau_feature_pipeline.transform(bureauDF)
# bb_aggregated = bb_feature_pipeline.transform(bbDF)
ccb_aggregated = ccb_feature_pipeline.transform(ccbDF)
ip_aggregated = ip_feature_pipeline.transform(ipDF)
# pos_aggregated = prevApps_feature_pipeline.transform(posDF)
#'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
# 'previous_application', 'POS_CASH_balance'
# merge primary table and secondary tables using features based on meta data and aggregage stats
if merge_all_data:
# 1. Join/Merge in prevApps Data
X_train = X_train.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
# 2. Join/Merge in ...... Data
X_train = X_train.merge(bureau_aggregated, how='left', on="SK_ID_CURR")
# 3. Join/Merge in .....Data
dX_train = X_train.merge(ccb_aggregated, how='left', on="SK_ID_CURR")
# 4. Join/Merge in Aggregated ...... Data
X_train = X_train.merge(ip_aggregated, how='left', on="SK_ID_CURR")
print(X_train.shape)
display(X_train)
(307511, 141)
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_ANNUITY_mean_y | AMT_CREDIT_SUM_min | AMT_CREDIT_SUM_max | AMT_CREDIT_SUM_mean | AMT_INSTALMENT_min | AMT_INSTALMENT_max | AMT_INSTALMENT_mean | AMT_PAYMENT_min | AMT_PAYMENT_max | AMT_PAYMENT_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 450000.0 | 108131.945625 | 9251.775 | 53093.745 | 11559.247105 | 9251.775 | 53093.745 | 11559.247105 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | NaN | 22248.0 | 810000.0 | 254350.125000 | 6662.970 | 560835.360 | 64754.586000 | 6662.970 | 560835.360 | 64754.586000 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | NaN | 94500.0 | 94537.8 | 94518.900000 | 5357.250 | 10573.965 | 7096.155000 | 5357.250 | 10573.965 | 7096.155000 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | NaN | NaN | NaN | NaN | 2482.920 | 691786.890 | 62947.088438 | 2482.920 | 691786.890 | 62947.088438 |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | NaN | 146250.0 | 146250.0 | 146250.000000 | 1821.780 | 22678.785 | 12666.444545 | 0.180 | 22678.785 | 12214.060227 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | 456251 | 0 | Cash loans | M | N | N | 0 | 157500.0 | 254700.0 | 27558.0 | ... | NaN | NaN | NaN | NaN | 6605.910 | 12815.010 | 7492.924286 | 6605.910 | 12815.010 | 7492.924286 |
| 307507 | 456252 | 0 | Cash loans | F | N | Y | 0 | 72000.0 | 269550.0 | 12001.5 | ... | NaN | NaN | NaN | NaN | 10046.880 | 10074.465 | 10069.867500 | 10046.880 | 10074.465 | 10069.867500 |
| 307508 | 456253 | 0 | Cash loans | F | N | Y | 0 | 153000.0 | 677664.0 | 29979.0 | ... | 58369.5 | 360000.0 | 2250000.0 | 990000.000000 | 2754.450 | 5575.185 | 4399.707857 | 27.270 | 5575.185 | 4115.915357 |
| 307509 | 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | ... | 0.0 | 45000.0 | 45000.0 | 45000.000000 | 2296.440 | 19065.825 | 10239.832895 | 2296.440 | 19065.825 | 10239.832895 |
| 307510 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 1081.5 | 22995.0 | 900000.0 | 345629.045455 | 11090.835 | 615229.515 | 41464.713649 | 34.965 | 669251.655 | 47646.215878 |
307511 rows × 141 columns
X_kaggle_test= datasets["application_test"]
if merge_all_data:
# 1. Join/Merge in prevApps Data
X_kaggle_test = X_kaggle_test.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
# 2. Join/Merge in ...... Data
X_kaggle_test = X_kaggle_test.merge(bureau_aggregated, how='left', on='SK_ID_CURR')
# 3. Join/Merge in .....Data
X_kaggle_test = X_kaggle_test.merge(ccb_aggregated, how='left', on='SK_ID_CURR')
# 4. Join/Merge in Aggregated ...... Data
X_kaggle_test = X_kaggle_test.merge(ip_aggregated, how='left', on='SK_ID_CURR')
print(X_kaggle_test.shape)
display(X_kaggle_test)
(48744, 149)
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | AMT_BALANCE_mean | CNT_INSTALMENT_MATURE_CUM_min | CNT_INSTALMENT_MATURE_CUM_max | CNT_INSTALMENT_MATURE_CUM_mean | AMT_INSTALMENT_min | AMT_INSTALMENT_max | AMT_INSTALMENT_mean | AMT_PAYMENT_min | AMT_PAYMENT_max | AMT_PAYMENT_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | NaN | NaN | NaN | NaN | 3951.000 | 17397.900 | 5885.132143 | 3951.000 | 17397.900 | 5885.132143 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | NaN | NaN | NaN | NaN | 4813.200 | 17656.245 | 6240.205000 | 4813.200 | 17656.245 | 6240.205000 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 18159.919219 | 1.0 | 22.0 | 18.719101 | 67.500 | 357347.745 | 10897.898516 | 6.165 | 357347.745 | 9740.235774 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 8085.058163 | 1.0 | 35.0 | 19.547619 | 1.170 | 38988.540 | 4979.282257 | 1.170 | 38988.540 | 4356.731549 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | NaN | NaN | NaN | NaN | 11097.450 | 11100.600 | 11100.337500 | 11097.450 | 11100.600 | 11100.337500 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48739 | 456221 | Cash loans | F | N | Y | 0 | 121500.0 | 412560.0 | 17473.5 | 270000.0 | ... | NaN | NaN | NaN | NaN | 14222.430 | 244664.505 | 91036.455000 | 14222.430 | 244664.505 | 91036.455000 |
| 48740 | 456222 | Cash loans | F | N | N | 2 | 157500.0 | 622413.0 | 31909.5 | 495000.0 | ... | NaN | NaN | NaN | NaN | 3653.955 | 14571.765 | 8086.162192 | 2.700 | 14571.765 | 7771.447603 |
| 48741 | 456223 | Cash loans | F | Y | Y | 1 | 202500.0 | 315000.0 | 33205.5 | 315000.0 | ... | NaN | NaN | NaN | NaN | 12640.950 | 81184.005 | 23158.991250 | 12640.950 | 81184.005 | 23158.991250 |
| 48742 | 456224 | Cash loans | M | N | N | 0 | 225000.0 | 450000.0 | 25128.0 | 450000.0 | ... | NaN | NaN | NaN | NaN | 5519.925 | 23451.705 | 17269.234138 | 5519.925 | 23451.705 | 17269.234138 |
| 48743 | 456250 | Cash loans | F | Y | N | 0 | 135000.0 | 312768.0 | 24709.5 | 270000.0 | ... | 173589.326250 | 0.0 | 10.0 | 4.583333 | 1.080 | 26474.625 | 13238.063100 | 1.080 | 26474.625 | 13044.983400 |
48744 rows × 149 columns
# approval rate 'NFLAG_INSURED_ON_APPROVAL'
# Convert categorical features to numerical approximations (via pipeline)
class ClaimAttributesAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
charlson_idx_dt = {'0': 0, '1-2': 2, '3-4': 4, '5+': 6}
los_dt = {'1 day': 1, '2 days': 2, '3 days': 3, '4 days': 4, '5 days': 5, '6 days': 6,
'1- 2 weeks': 11, '2- 4 weeks': 21, '4- 8 weeks': 42, '26+ weeks': 180}
X['PayDelay'] = X['PayDelay'].apply(lambda x: int(x) if x != '162+' else int(162))
X['DSFS'] = X['DSFS'].apply(lambda x: None if pd.isnull(x) else int(x[0]) + 1)
X['CharlsonIndex'] = X['CharlsonIndex'].apply(lambda x: charlson_idx_dt[x])
X['LengthOfStay'] = X['LengthOfStay'].apply(lambda x: None if pd.isnull(x) else los_dt[x])
return X
Train, validation and Test sets (and the leakage problem we have mentioned previously):
Let's look at a small usecase to tell us how to deal with this:
ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.
Here is a example that in action:
# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer(return_X_y=False)
X, y = load_breast_cancer(return_X_y=True)
print(y[[10, 50, 85]])
#([0, 1, 0])
list(data.target_names)
#['malignant', 'benign']
X.shape
[0 1 0]
(569, 30)
data.feature_names
array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error',
'fractal dimension error', 'worst radius', 'worst texture',
'worst perimeter', 'worst area', 'worst smoothness',
'worst compactness', 'worst concavity', 'worst concave points',
'worst symmetry', 'worst fractal dimension'], dtype='<U23')
Please this blog for more details of OHE when the validation/test have previously unseen unique values.
bold textABSTRACT
from sklearn.model_selection import train_test_split
# Split the provided training data into training and validationa and test
# The kaggle evaluation test set has no labels
def load_train_valid_test_data(list_of_features=None):
global X_train, X_valid, X_test, y_train, y_valid, y_test
if list_of_features is None:
list_of_features = [
'SK_ID_CURR', 'AMT_INCOME_TOTAL', 'AMT_CREDIT','DAYS_EMPLOYED',
'DAYS_BIRTH','EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE',
'NAME_TYPE_SUITE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
'ORGANIZATION_TYPE'
]
print("-+-+-"*10)
print("Using Application data with selected features ...vvv")
print(list_of_features)
print("-+-+-"*10)
X_train = datasets["application_train"][list_of_features]
y_train = datasets["application_train"]['TARGET']
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_kaggle_test= datasets["application_test"][list_of_features]
# y_test = datasets["application_test"]['TARGET'] #why no TARGET?!! (hint: kaggle competition)
print("-------------------------------------------------")
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print(f"X X_kaggle_test shape: {X_kaggle_test.shape}")
print(f"Y train shape: {y_train.shape}")
print(f"Y validation shape: {y_valid.shape}")
print(f"Y test shape: {y_test.shape}")
return X_train, X_valid, X_test, y_train, y_valid, y_test
X_train, X_valid, X_test, y_train, y_valid, y_test = load_train_valid_test_data(list_of_features=None)
-+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+- Using Application data with selected features ...vvv ['SK_ID_CURR', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED', 'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'CODE_GENDER', 'FLAG_OWN_REALTY', 'FLAG_OWN_CAR', 'NAME_CONTRACT_TYPE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'NAME_INCOME_TYPE', 'NAME_TYPE_SUITE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'ORGANIZATION_TYPE'] -+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+- ------------------------------------------------- X train shape: (222176, 21) X validation shape: (46127, 21) X test shape: (39208, 21) X X_kaggle_test shape: (48744, 21) Y train shape: (222176,) Y validation shape: (46127,) Y test shape: (39208,)
Pipelines for all secondary tablesbureauprevious_applicationsFeature Union to combine multiple Pipelinesclass ApplicationFeatureTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X['DAYS_EMPLOYED'] = (X['DAYS_EMPLOYED']/-365).astype(float)
X['DAYS_BIRTH'] = (X['DAYS_BIRTH']/-365).astype(float)
return X
X_train
from sklearn.pipeline import make_pipeline
test_pipeline = make_pipeline(ApplicationFeatureTransformer())
print(test_pipeline.fit_transform(X_train))
#####################-- Python Debugging---################################
# from IPython.core.debugger import Pdb as pdb
# pdb().set_trace()
# breakpoint dont forget to quit
############################################################
SK_ID_CURR AMT_INCOME_TOTAL AMT_CREDIT DAYS_EMPLOYED DAYS_BIRTH \
21614 125178 180000.0 1305000.0 2.402740 34.841096
209797 343134 81000.0 450000.0 3.556164 33.717808
17976 120964 90000.0 127350.0 -1000.665753 61.386301
282543 427277 135000.0 460858.5 0.632877 23.331507
52206 160455 225000.0 611905.5 -1000.665753 41.805479
... ... ... ... ... ...
144129 267117 270000.0 1762110.0 19.775342 64.531507
32963 138205 112500.0 284400.0 1.046575 27.282192
90412 204966 45000.0 180000.0 12.134247 32.898630
246459 385249 202500.0 1736937.0 1.569863 27.969863
212146 345838 58500.0 157500.0 5.682192 23.975342
EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 CODE_GENDER FLAG_OWN_REALTY \
21614 0.506595 0.039170 0.415347 F Y
209797 NaN 0.198386 NaN F N
17976 NaN 0.589705 0.735221 F Y
282543 NaN 0.000954 0.065550 M Y
52206 NaN 0.263144 0.160489 M N
... ... ... ... ... ...
144129 0.748672 0.679988 0.553165 F Y
32963 0.297779 0.394895 NaN M N
90412 NaN 0.671937 0.273565 F N
246459 NaN 0.086790 0.520898 F Y
212146 NaN 0.363715 0.368969 F Y
... NAME_CONTRACT_TYPE NAME_EDUCATION_TYPE OCCUPATION_TYPE \
21614 ... Cash loans Higher education Sales staff
209797 ... Cash loans Secondary / secondary special Laborers
17976 ... Cash loans Higher education NaN
282543 ... Cash loans Secondary / secondary special Security staff
52206 ... Cash loans Higher education NaN
... ... ... ... ...
144129 ... Cash loans Secondary / secondary special Secretaries
32963 ... Cash loans Higher education Drivers
90412 ... Revolving loans Secondary / secondary special Core staff
246459 ... Cash loans Secondary / secondary special Sales staff
212146 ... Revolving loans Secondary / secondary special Sales staff
NAME_INCOME_TYPE NAME_TYPE_SUITE NAME_FAMILY_STATUS \
21614 Commercial associate Family Married
209797 Working Unaccompanied Single / not married
17976 Pensioner Unaccompanied Widow
282543 Working Unaccompanied Single / not married
52206 Pensioner Unaccompanied Civil marriage
... ... ... ...
144129 Commercial associate Unaccompanied Married
32963 Commercial associate Unaccompanied Single / not married
90412 State servant Unaccompanied Married
246459 Commercial associate Unaccompanied Civil marriage
212146 Commercial associate Unaccompanied Single / not married
NAME_HOUSING_TYPE WEEKDAY_APPR_PROCESS_START \
21614 House / apartment TUESDAY
209797 House / apartment MONDAY
17976 Municipal apartment THURSDAY
282543 House / apartment TUESDAY
52206 Municipal apartment SATURDAY
... ... ...
144129 House / apartment WEDNESDAY
32963 House / apartment THURSDAY
90412 House / apartment WEDNESDAY
246459 Municipal apartment TUESDAY
212146 With parents TUESDAY
HOUR_APPR_PROCESS_START ORGANIZATION_TYPE
21614 12 Trade: type 1
209797 15 Business Entity Type 1
17976 9 XNA
282543 13 Business Entity Type 3
52206 17 XNA
... ... ...
144129 9 Trade: type 2
32963 18 Business Entity Type 3
90412 12 Government
246459 14 Trade: type 7
212146 16 Business Entity Type 3
[222176 rows x 21 columns]
s_data = bur[bur["SK_ID_CURR"]==100001 ]
display(s_data)
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 248484 | 100001 | 5896630 | Closed | currency 1 | -857 | 0 | -492.0 | -553.0 | NaN | 0 | 112500.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -155 | 0.0 |
| 248485 | 100001 | 5896631 | Closed | currency 1 | -909 | 0 | -179.0 | -877.0 | NaN | 0 | 279720.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -155 | 0.0 |
| 248486 | 100001 | 5896632 | Closed | currency 1 | -879 | 0 | -514.0 | -544.0 | NaN | 0 | 91620.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -155 | 0.0 |
| 248487 | 100001 | 5896633 | Closed | currency 1 | -1572 | 0 | -1329.0 | -1328.0 | NaN | 0 | 85500.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -155 | 0.0 |
| 248488 | 100001 | 5896634 | Active | currency 1 | -559 | 0 | 902.0 | NaN | NaN | 0 | 337680.0 | 113166.0 | 0.0 | 0.0 | Consumer credit | -6 | 4630.5 |
| 248489 | 100001 | 5896635 | Active | currency 1 | -49 | 0 | 1778.0 | NaN | NaN | 0 | 378000.0 | 373239.0 | 0.0 | 0.0 | Consumer credit | -16 | 10822.5 |
| 248490 | 100001 | 5896636 | Active | currency 1 | -320 | 0 | 411.0 | NaN | NaN | 0 | 168345.0 | 110281.5 | NaN | 0.0 | Consumer credit | -10 | 9364.5 |
# Total number of Past loans ?
s_data[["SK_ID_CURR", "CREDIT_ACTIVE"]].groupby("SK_ID_CURR").count().reset_index().rename(columns = {'CREDIT_ACTIVE':'TOTAL_PAST_LOANS'})
# left join with main table
| SK_ID_CURR | TOTAL_PAST_LOANS | |
|---|---|---|
| 0 | 100001 | 7 |
# Total types of loan
s_data[["SK_ID_CURR", "CREDIT_TYPE"]].groupby("SK_ID_CURR").nunique().reset_index().rename(columns = {'CREDIT_TYPE':'TOTAL_TYPES_OF_LOAN'})
# left join with main table
| SK_ID_CURR | TOTAL_TYPES_OF_LOAN | |
|---|---|---|
| 0 | 100001 | 1 |
# % of active loans
s_data["is_credit_active"]= s_data[["CREDIT_ACTIVE"]].apply(func= lambda x: False if x.CREDIT_ACTIVE=="Closed" else True, axis=1)
s_data[["SK_ID_CURR", "is_credit_active"]].groupby("SK_ID_CURR").mean().reset_index().rename(columns = {'is_credit_active':'ACTIVE_LOANS_MEAN'})
# left join with main table
| SK_ID_CURR | ACTIVE_LOANS_MEAN | |
|---|---|---|
| 0 | 100001 | 0.428571 |
# average of (days to credit end) for active credit.
with_active_credits = s_data[s_data["is_credit_active"]]
# display(with_active_credits)
if len(with_active_credits):
print(with_active_credits[["SK_ID_CURR", "DAYS_CREDIT_ENDDATE"]].groupby("SK_ID_CURR").mean().reset_index().rename(columns={'DAYS_CREDIT_ENDDATE': 'DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN'}))
# left join with main table
# ! WARNING: When joining above dataframe with main table, do not forget to fill
# empty values in column `DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN` with 0 since
# there will be some applicants who will have no active credits thus `DAYS_CREDIT_ENDDATE` will be np.NaN( after joining)
# so you don't want to impute the nan values in this column
SK_ID_CURR DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN 0 100001 1030.333333
# mean number of prolonged credits
# s_data[~s_data["CNT_CREDIT_PROLONG"].isna()]
# mean overdue loands with % of active loans ??????
t_data = bur[~bur["AMT_CREDIT_MAX_OVERDUE"].isna()][bur.SK_ID_CURR == 215354]
t_data.groupby("SK_ID_CURR")["AMT_CREDIT_MAX_OVERDUE"].mean().reset_index().rename(columns={'AMT_CREDIT_MAX_OVERDUE': 'AMT_CREDIT_MAX_OVERDUE_NORMALIZED_MEAN'})
# left join with main table
# ! WARNING: When joining above dataframe with main table, do not forget to fill
# empty values in column `DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN` with 0 since
# there will be some applicants who will have no active credits thus `DAYS_CREDIT_ENDDATE` will be np.NaN( after joining)
# so you don't want to impute the nan values in this column
| SK_ID_CURR | AMT_CREDIT_MAX_OVERDUE_NORMALIZED_MEAN | |
|---|---|---|
| 0 | 215354 | 25891.5 |
# % of utilized debt??
def fun(*arg):
return arg[0].AMT_CREDIT_SUM_DEBT/ (arg[0].AMT_CREDIT_SUM - arg[0].AMT_CREDIT_SUM_OVERDUE)
x_data = bur[bur.CREDIT_ACTIVE=="Active"][~bur["AMT_CREDIT_MAX_OVERDUE"].isna()]
r_cols = ["AMT_CREDIT_SUM", "AMT_CREDIT_SUM_DEBT", "AMT_CREDIT_SUM_OVERDUE"]
t_data = x_data[x_data.SK_ID_CURR==162297].groupby("SK_ID_CURR")[r_cols].sum().reset_index()
t_data["UTILIZED_DEBT"] = t_data.apply(fun, axis=1)
t_data.drop(r_cols, axis=1)
# left join with main table
# ! WARNING: When joining above dataframe with main table, do not forget to fill
# empty values in column `DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN` with 0 since
# there will be some applicants who will have no active credits thus `DAYS_CREDIT_ENDDATE` will be np.NaN( after joining)
# so you don't want to impute the nan values in this column
| SK_ID_CURR | UTILIZED_DEBT | |
|---|---|---|
| 0 | 162297 | 0.0 |
all_counts = bb.groupby("SK_ID_BUREAU")["STATUS"].count()
b_data = bb[(bb.STATUS =="C") | (bb.STATUS =="0")].groupby("SK_ID_BUREAU")["STATUS"]
# count of closed or completed records
b_data.count().reset_index().rename(columns={"STATUS": "STATUS_COMPLETED_COUNT"}).head(3)
| SK_ID_BUREAU | STATUS_COMPLETED_COUNT | |
|---|---|---|
| 0 | 5001709 | 86 |
| 1 | 5001710 | 53 |
| 2 | 5001711 | 3 |
a = bur[["SK_ID_CURR", "SK_ID_BUREAU"]].merge(t_data, how="left", on="SK_ID_BUREAU").drop("SK_ID_BUREAU", axis=1)
a[~a.STATUS_COMPLETED_MEAN.isna()].head(5)
# mean of closed or completed records
# higher the value better it is
t_data = bb[(bb.STATUS =="C") | (bb.STATUS =="0")].groupby("SK_ID_BUREAU")["STATUS"].count() / all_counts
t_data = t_data.reset_index().rename(columns={"STATUS": "STATUS_COMPLETED_MEAN"}).fillna(0)
### Merge this with bureau and then to the application table
# count of records with due past dud date
y_data = bb[(bb.STATUS == "3") | (bb.STATUS == "4") | (bb.STATUS == "5")].groupby("SK_ID_BUREAU")["STATUS"].count()
y_data.reset_index().rename(columns={"STATUS": "STATUS_PAST_DUE_COUNT"}).head(3)
| SK_ID_BUREAU | STATUS_PAST_DUE_COUNT | |
|---|---|---|
| 0 | 5001797 | 4 |
| 1 | 5001799 | 4 |
| 2 | 5001928 | 2 |
# % of records with due past dud date
x_data = bb[(bb.STATUS == "3") | (bb.STATUS == "4") | (bb.STATUS == "5")].groupby("SK_ID_BUREAU")["STATUS"].count() / all_counts
x_data.reset_index().rename(columns={"STATUS": "STATUS_PAST_DUE_MEAN"}).fillna(0).head(10)
| SK_ID_BUREAU | STATUS_PAST_DUE_MEAN | |
|---|---|---|
| 0 | 5001709 | 0.0 |
| 1 | 5001710 | 0.0 |
| 2 | 5001711 | 0.0 |
| 3 | 5001712 | 0.0 |
| 4 | 5001713 | 0.0 |
| 5 | 5001714 | 0.0 |
| 6 | 5001715 | 0.0 |
| 7 | 5001716 | 0.0 |
| 8 | 5001717 | 0.0 |
| 9 | 5001718 | 0.0 |
# % of records where status is unknown
x_data = bb[bb.STATUS == "X"].groupby("SK_ID_BUREAU")["STATUS"].count() / all_counts
x_data.reset_index().rename(columns={"STATUS": "STATUS_PAST_DUE_UNKNOWN_MEAN"}).fillna(0).head(10)
| SK_ID_BUREAU | STATUS_PAST_DUE_UNKNOWN_MEAN | |
|---|---|---|
| 0 | 5001709 | 0.113402 |
| 1 | 5001710 | 0.361446 |
| 2 | 5001711 | 0.250000 |
| 3 | 5001712 | 0.000000 |
| 4 | 5001713 | 1.000000 |
| 5 | 5001714 | 1.000000 |
| 6 | 5001715 | 1.000000 |
| 7 | 5001716 | 0.232558 |
| 8 | 5001717 | 0.000000 |
| 9 | 5001718 | 0.256410 |
### Do this features need to imputed????
numerical_features = ["EXT_SOURCE_3","EXT_SOURCE_2", "EXT_SOURCE_1"]
# combine these with above but require sepearte processing
"DAYS_EMPLOYED", 'DAYS_BIRTH'
cat_application_features = ["NAME_CONTRACT_TYPE", "NAME_TYPE_SUITE", "NAME_INCOME_TYPE", "NAME_EDUCATION_TYPE", "NAME_FAMILY_STATUS", "NAME_HOUSING_TYPE", "OCCUPATION_TYPE", "WEEKDAY_APPR_PROCESS_START", "HOUR_APPR_PROCESS_START", "ORGANIZATION_TYPE", "CODE_GENDER", "FLAG_OWN_CAR", "FLAG_OWN_REALTY"]
ccb.columns
ccb[ccb.AMT_RECIVABLE!=0].sort_values(by="SK_ID_CURR").tail(10)
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1684283 | 1794451 | 456250 | -10 | 186577.605 | 180000 | 0.0 | 0.0 | 0.0 | 0.0 | 9892.485 | ... | 185907.105 | 185907.105 | 0.0 | 0 | 0.0 | 0.0 | 1.0 | Active | 0 | 0 |
| 3734047 | 1794451 | 456250 | -9 | 180536.760 | 180000 | 0.0 | 0.0 | 0.0 | 0.0 | 9653.985 | ... | 179866.260 | 179866.260 | 0.0 | 0 | 0.0 | 0.0 | 2.0 | Active | 0 | 0 |
| 2985617 | 1794451 | 456250 | -8 | 177219.000 | 180000 | 0.0 | 0.0 | 0.0 | 0.0 | 9465.705 | ... | 176958.900 | 176958.900 | 0.0 | 0 | 0.0 | 0.0 | 3.0 | Active | 0 | 0 |
| 3000394 | 1794451 | 456250 | -12 | 181993.500 | 180000 | 171000.0 | 171000.0 | 0.0 | 0.0 | 0.000 | ... | 171000.000 | 171000.000 | 7.0 | 7 | 0.0 | 0.0 | 0.0 | Active | 0 | 0 |
| 310299 | 1794451 | 456250 | -4 | 166188.150 | 180000 | 0.0 | 0.0 | 0.0 | 0.0 | 8804.565 | ... | 166188.150 | 166188.150 | 0.0 | 0 | 0.0 | 0.0 | 7.0 | Active | 0 | 0 |
| 1049726 | 1794451 | 456250 | -2 | 158266.935 | 175500 | 0.0 | 0.0 | 0.0 | 0.0 | 8477.730 | ... | 158266.935 | 158266.935 | 0.0 | 0 | 0.0 | 0.0 | 9.0 | Active | 0 | 0 |
| 431924 | 1794451 | 456250 | -6 | 171943.020 | 180000 | 0.0 | 0.0 | 0.0 | 0.0 | 9084.375 | ... | 171943.020 | 171943.020 | 0.0 | 0 | 0.0 | 0.0 | 5.0 | Active | 0 | 0 |
| 3611324 | 1794451 | 456250 | -11 | 200208.915 | 180000 | 9000.0 | 9000.0 | 0.0 | 0.0 | 0.000 | ... | 196581.915 | 196581.915 | 1.0 | 1 | 0.0 | 0.0 | 0.0 | Active | 0 | 0 |
| 1154348 | 1794451 | 456250 | -3 | 162425.565 | 175500 | 0.0 | 0.0 | 0.0 | 0.0 | 8643.600 | ... | 162425.565 | 162425.565 | 0.0 | 0 | 0.0 | 0.0 | 8.0 | Active | 0 | 0 |
| 2248506 | 1794451 | 456250 | -7 | 174435.885 | 180000 | 0.0 | 0.0 | 0.0 | 0.0 | 9240.705 | ... | 174435.885 | 174435.885 | 0.0 | 0 | 0.0 | 0.0 | 4.0 | Active | 0 | 0 |
10 rows × 23 columns
def func(*args):
print(args[0])
list(filter(lambda x: x!=1, ccb.groupby(["SK_ID_CURR", "SK_ID_PREV"]).SK_ID_PREV.nunique().values))
[]
c_data = ccb[ccb.SK_ID_CURR==100011].sort_values(by="MONTHS_BALANCE", ascending=False)
c_data
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2739019 | 1843384 | 100011 | -2 | 0.000 | 90000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 33.0 | Active | 0 | 0 |
| 3496910 | 1843384 | 100011 | -3 | 0.000 | 90000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 33.0 | Active | 0 | 0 |
| 51047 | 1843384 | 100011 | -4 | 0.000 | 90000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 33.0 | Active | 0 | 0 |
| 2674883 | 1843384 | 100011 | -5 | 0.000 | 90000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 33.0 | Active | 0 | 0 |
| 131693 | 1843384 | 100011 | -6 | 0.000 | 90000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 33.0 | Active | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1872143 | 1843384 | 100011 | -71 | 173901.915 | 180000 | 0.0 | 0.0 | 0.0 | 0.0 | 9000.0 | ... | 173901.915 | 173901.915 | 0.0 | 0 | 0.0 | 0.0 | 4.0 | Active | 0 | 0 |
| 1086495 | 1843384 | 100011 | -72 | 177544.350 | 180000 | 0.0 | 0.0 | 0.0 | 0.0 | 9000.0 | ... | 177544.350 | 177544.350 | 0.0 | 0 | 0.0 | 0.0 | 3.0 | Active | 0 | 0 |
| 2353190 | 1843384 | 100011 | -73 | 181044.540 | 180000 | 0.0 | 0.0 | 0.0 | 0.0 | 9000.0 | ... | 181044.540 | 181044.540 | 0.0 | 0 | 0.0 | 0.0 | 2.0 | Active | 0 | 0 |
| 2447092 | 1843384 | 100011 | -74 | 184568.850 | 180000 | 0.0 | 0.0 | 0.0 | 0.0 | 9000.0 | ... | 184568.850 | 184568.850 | 0.0 | 0 | 0.0 | 0.0 | 1.0 | Active | 0 | 0 |
| 3131464 | 1843384 | 100011 | -75 | 189000.000 | 180000 | 180000.0 | 180000.0 | 0.0 | 0.0 | NaN | ... | 189000.000 | 189000.000 | 4.0 | 4 | 0.0 | 0.0 | NaN | Active | 0 | 0 |
74 rows × 23 columns
# Total number of credit card loans per customer, Dataset: CCB
grp = c_data.groupby(by = ['SK_ID_CURR'])['SK_ID_PREV'].nunique().reset_index().rename(columns = {'SK_ID_PREV': 'TOTAL_LOANS'})
# CCB = CCB.merge(grp, on = ['SK_ID_CURR'], how = 'left')
display(grp)
del grp
gc.collect()
| SK_ID_CURR | TOTAL_LOANS | |
|---|---|---|
| 0 | 100011 | 1 |
88
# Maxium number of installments per loan
max_no_credit_install = c_data.groupby(by =['SK_ID_CURR', 'SK_ID_PREV'])['CNT_INSTALMENT_MATURE_CUM'].max().reset_index().rename(columns = {'CNT_INSTALMENT_MATURE_CUM': 'MAX_NO_CREDIT_INSTALMENTS'})
# Total number of installments per loan
total_credit_installments = max_no_credit_install.groupby(["SK_ID_CURR"]).sum().MAX_NO_CREDIT_INSTALMENTS.reset_index().rename(index = str, columns = {'NO_INSTALMENTS': 'TOTAL_CREDIT_INSTALMENTS'})
del max_no_credit_install, total_credit_installments
gc.collect()
39
# total & mean installments past due date
def past_due_date_count(*arg):
return len(list(filter(lambda x: x!=0, arg[0].SK_DPD.values)))
total_past_due_date = ccb.groupby(["SK_ID_CURR","SK_ID_PREV"]).apply(past_due_date_count).reset_index().rename({0:"TOTAL_PAST_DUE_DATE"},axis=1)
mean_past_due_date = total_past_due_date.groupby(["SK_ID_CURR","SK_ID_PREV"]).TOTAL_PAST_DUE_DATE.mean().reset_index().rename({0:"MEAN_PAST_DUE_DATE"},axis=1)
del total_past_due_date, mean_past_due_date
gc.collect()
0
grouped_data = ccb.groupby(["SK_ID_CURR", "SK_ID_PREV"])
r_cols = ["AMT_DRAWINGS_ATM_CURRENT","AMT_DRAWINGS_CURRENT","AMT_DRAWINGS_OTHER_CURRENT","AMT_DRAWINGS_POS_CURRENT"]
grouped_data[r_cols].mean().fillna(0).reset_index().rename({ _ : _ +"_MEAN" for _ in r_cols}, axis=1)
| SK_ID_CURR | SK_ID_PREV | AMT_DRAWINGS_ATM_CURRENT_MEAN | AMT_DRAWINGS_CURRENT_MEAN | AMT_DRAWINGS_OTHER_CURRENT_MEAN | AMT_DRAWINGS_POS_CURRENT_MEAN | |
|---|---|---|---|---|---|---|
| 0 | 100006 | 1489396 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
| 1 | 100011 | 1843384 | 2432.432432 | 2432.432432 | 0.0 | 0.000000 |
| 2 | 100013 | 2038692 | 6350.000000 | 5953.125000 | 0.0 | 0.000000 |
| 3 | 100021 | 2594025 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
| 4 | 100023 | 1499902 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... |
| 104302 | 456244 | 2181926 | 24475.609756 | 26842.388049 | 0.0 | 2363.015854 |
| 104303 | 456246 | 1079732 | 0.000000 | 15199.256250 | 0.0 | 15199.256250 |
| 104304 | 456247 | 1595171 | 2136.315789 | 2149.506474 | 0.0 | 13.190684 |
| 104305 | 456248 | 2743495 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
| 104306 | 456250 | 1794451 | 15000.000000 | 15000.000000 | 0.0 | 0.000000 |
104307 rows × 6 columns
class BureauFeaturesAgg(BaseEstimator, TransformerMixin):
def __init__(self, bur_dataset): # no *args or **kargs
print("Called Feature Aggregator for Datasets : `Bureau`")
self.bur = bur_dataset
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# Total number of Past loans ?, Dataset: Bureau
past_loans = self.bur[["SK_ID_CURR", "CREDIT_ACTIVE"]].groupby("SK_ID_CURR").count().reset_index().rename(columns = {'CREDIT_ACTIVE':'TOTAL_PAST_LOANS'})
# Total types of loan, Dataset: Bureau
types_of_loan = self.bur[["SK_ID_CURR", "CREDIT_TYPE"]].groupby("SK_ID_CURR").nunique().reset_index().rename(columns = {'CREDIT_TYPE':'TOTAL_TYPES_OF_LOAN'})
# % of active loans, Dataset: Bureau
self.bur["is_credit_active"]= self.bur[["CREDIT_ACTIVE"]].apply(func= lambda x: False if x.CREDIT_ACTIVE=="Closed" else True, axis=1)
active_loans_mean = self.bur[["SK_ID_CURR", "is_credit_active"]].groupby("SK_ID_CURR").mean().reset_index().rename(columns = {'is_credit_active':'ACTIVE_LOANS_MEAN'})
# average of (days to credit end) for active credit. , Dataset: Bureau
with_active_credits = self.bur[self.bur["is_credit_active"]]
# if len(with_active_credits):
days_to_credit_end_mean = with_active_credits[["SK_ID_CURR", "DAYS_CREDIT_ENDDATE"]].groupby("SK_ID_CURR").mean().reset_index().rename(columns={'DAYS_CREDIT_ENDDATE': 'DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN'})
days_to_credit_end_mean["DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN"] = days_to_credit_end_mean["DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN"].apply(lambda x: 0 if x/365 < 0 else x/365)
# mean amount of prolonged credits
_max_overdue = self.bur[~self.bur["AMT_CREDIT_MAX_OVERDUE"].isna()]
max_overdue = _max_overdue.groupby("SK_ID_CURR")["AMT_CREDIT_MAX_OVERDUE"].mean().reset_index().rename(columns={'AMT_CREDIT_MAX_OVERDUE': 'AMT_CREDIT_MAX_OVERDUE_NORMALIZED_MEAN'})
# % of utilized debt??
def mean_debt(*arg):
return arg[0].AMT_CREDIT_SUM_DEBT/ (arg[0].AMT_CREDIT_SUM - arg[0].AMT_CREDIT_SUM_OVERDUE)
_max_overdue_active = with_active_credits[~self.bur["AMT_CREDIT_MAX_OVERDUE"].isna()]
r_cols = ["AMT_CREDIT_SUM", "AMT_CREDIT_SUM_DEBT", "AMT_CREDIT_SUM_OVERDUE"]
max_overdue_active = _max_overdue_active.groupby("SK_ID_CURR")[r_cols].sum().reset_index()
max_overdue_active["UTILIZED_DEBT"] = max_overdue_active.apply(mean_debt, axis=1)
max_overdue_active = max_overdue_active.drop(r_cols, axis=1)
_result_1 = X.merge(past_loans, on="SK_ID_CURR", how="left")
_result_2 = _result_1.merge(types_of_loan, on="SK_ID_CURR", how="left")
_result_3 = _result_2.merge(active_loans_mean, on="SK_ID_CURR", how="left")
_result_4 = _result_3.merge(days_to_credit_end_mean, on="SK_ID_CURR", how="left")
_result_5 = _result_4.merge(max_overdue, on="SK_ID_CURR", how="left")
result = _result_5.merge(max_overdue_active, on="SK_ID_CURR", how="left")
new_cols = ["TOTAL_PAST_LOANS", "TOTAL_TYPES_OF_LOAN", "ACTIVE_LOANS_MEAN", "DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN", "AMT_CREDIT_MAX_OVERDUE_NORMALIZED_MEAN", "UTILIZED_DEBT"]
result[new_cols].fillna(0, inplace=True)
return result
def test_driver_prevAppsFeaturesAggregater(df, features):
print("Executing the test driver............")
print(f"df.shape: {df.shape}\n")
print(f"df[{features}][0:5]: \n")
display(df[features].head(5))
print("---- Testing with `make_pipeline`---------")
from sklearn.pipeline import make_pipeline
test_pipeline = make_pipeline(BureauFeaturesAgg(bur))
return(test_pipeline.fit_transform(df))
# All features of previous applications .....
features = ['AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CNT_PAYMENT',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION']
# Features of interest.....
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
print("\n\n----- Results ----------")
print(f"Test driver: \n")
display(res.head(10))
print(f"input[features][0:10]: \n")
display(appsDF.head(10))
test_columns = ["SK_ID_CURR", "TOTAL_PAST_LOANS", "TOTAL_TYPES_OF_LOAN", "ACTIVE_LOANS_MEAN", "DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN", "AMT_CREDIT_MAX_OVERDUE_NORMALIZED_MEAN", "UTILIZED_DEBT"]
display(res[test_columns].head(25))
# QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
def fetch_bur(x):
return bur[bur.SK_ID_CURR==x]
fetch_bur(271877)
# 176158
s = bur[bur.SK_ID_CURR==202054][bur.CREDIT_ACTIVE == "Active"]
t = (s[["SK_ID_CURR", "DAYS_CREDIT_ENDDATE"]].groupby("SK_ID_CURR").mean()/365).reset_index().rename(columns={'DAYS_CREDIT_ENDDATE': 'DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN'})
t["DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN"] = t["DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN"].apply(lambda x: 0 if x < 0 else x)
t
[_ for _ in bur.columns if _.startswith("AMT")]
class FeaturesAggregater(BaseEstimator, TransformerMixin):
def __init__(self, ds, features, groupby_col, agg_previous_features=False): # no *args or **kargs
self.dataset = ds
self.features = features
self.groupby = groupby_col
self.agg_op_features = {}
self.agg_previous_features = agg_previous_features
for f in features:
self.agg_op_features[f]=[]
self.agg_op_features[f].extend((f"{f}_{func}",func) for func in ["min", "max", "mean"])
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# X is application table
_agg_dataset = self.dataset.groupby([self.groupby]).agg(self.agg_op_features)
_agg_dataset.columns = _agg_dataset.columns.droplevel()
_agg_dataset = _agg_dataset.reset_index(level=[self.groupby])
_agg_dataset.fillna(0, inplace=True)
if self.agg_previous_features:
result = X.merge(_agg_dataset, on="SK_ID_CURR", how="left")
return result
result = _agg_dataset
return result
# ####################-- Python Debugging---################################
# from IPython.core.debugger import Pdb as pdb
# pdb().set_trace()
# breakpoint dont forget to quit
# ###########################################################
# # X is application table
# new_features = [y[0] for x in self.agg_op_features.values() for y in x]
# result = X.merge(_agg_dataset, on="SK_ID_CURR", how="left")
# result[new_features].fillna(0, inplace=True)
# return result
def test_driver_prevAppsFeaturesAggregater(df, features):
print("Executing the test driver............")
print(f"df.shape: {df.shape}\n")
print(f"df[{features}][0:5]: \n")
# display(df[features].head(5))
print("---- Testing with `make_pipeline`---------")
from sklearn.pipeline import make_pipeline
test_pipeline = make_pipeline(FeaturesAggregater(bur, features,"SK_ID_CURR"))
return(test_pipeline.fit_transform(df))
# All features of previous applications .....
features = ['AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CNT_PAYMENT',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION']
# Features of interest.....
bureau_features = ['AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE','AMT_ANNUITY']
res = test_driver_prevAppsFeaturesAggregater(appsDF, bureau_features)
print("\n\n----- Results ----------")
print(f"Test driver: \n")
display(res.head(10))
print(f"input[features][0:10]: \n")
display(appsDF.head(10))
# test_columns = ["SK_ID_CURR", "TOTAL_PAST_LOANS", "TOTAL_TYPES_OF_LOAN", "ACTIVE_LOANS_MEAN", "DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN", "AMT_CREDIT_MAX_OVERDUE_NORMALIZED_MEAN", "UTILIZED_DEBT"]
# display(res[test_columns].head(25))
class BureauBalanceFeaturesAgg(BaseEstimator, TransformerMixin):
def __init__(self, bb_dataset, bur_dataset):
print("Called Feature Aggregator for Datasets : `Bureau Balance`")
self.bb = bb_dataset
self.bur = bur_dataset
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
########################################################################
all_counts = self.bb.groupby("SK_ID_BUREAU")["STATUS"].count()
# Count of completed payments
_completed_status_count = self.bb[(self.bb.STATUS =="C") | (self.bb.STATUS =="0")].groupby("SK_ID_BUREAU")["STATUS"].count()
completed_status = _completed_status_count.reset_index().rename(columns={"STATUS": "STATUS_COMPLETED_COUNT"}).fillna(0)
# mean of closed or completed records
_mean_completed_status = _completed_status_count / all_counts
mean_completed_status = _mean_completed_status.reset_index().rename(columns={"STATUS": "STATUS_COMPLETED_MEAN"}).fillna(0)
# Memory Overflow past this point
# count of records which are past due date
# cond = (self.bb.STATUS == "3") | (self.bb.STATUS == "4") | (self.bb.STATUS == "5")
# _past_due_status_count = self.bb[cond].groupby("SK_ID_BUREAU")["STATUS"].count()
# past_due_status_count = _past_due_status_count.reset_index().rename(columns={"STATUS": "STATUS_PAST_DUE_COUNT"})
# # % of records with due past dud date
# _mean_past_status = past_due_status_count/ all_counts
# mean_past_status = _mean_past_status.reset_index().rename(columns={"STATUS": "STATUS_PAST_DUE_MEAN"}).fillna(0)
# # % of records where status is unknown
# _status_unknown = self.bb[self.bb.STATUS == "X"].groupby("SK_ID_BUREAU")["STATUS"].count() / all_counts
# status_unknown = _status_unknown.reset_index().rename(columns={"STATUS": "STATUS_PAST_DUE_UNKNOWN_MEAN"}).fillna(0)
_bureau_data = self.bur[["SK_ID_CURR", "SK_ID_BUREAU"]]
_result_1 = _bureau_data.merge(completed_status, how="left", on="SK_ID_BUREAU")
_result_2 = _result_1.merge(mean_completed_status, how="left", on="SK_ID_BUREAU")
# _result_3 = _result_2.merge(past_due_status_count, how="left", on="SK_ID_BUREAU")
# _result_4 = _result_3.merge(mean_past_status, how="left", on="SK_ID_BUREAU")
# _result_5 = _result_4.merge(status_unknown, how="left", on="SK_ID_BUREAU")
# _result_6 = _result_5.drop("SK_ID_BUREAU", axis=1)
_result_6 = _result_2.drop("SK_ID_BUREAU", axis=1)
# Merge with original table
result = X.merge(_result_6, on="SK_ID_CURR", how="left")
# "STATUS_PAST_DUE_COUNT", "STATUS_PAST_DUE_MEAN", "STATUS_PAST_DUE_UNKNOWN_MEAN" not included
new_cols = ["STATUS_COMPLETED_COUNT", "STATUS_COMPLETED_MEAN"]
result[new_cols].fillna(0, inplace=True)
return result
def test_driver_prevAppsFeaturesAggregater(df, features):
print("Executing the test driver............")
print(f"df.shape: {df.shape}\n")
print(f"df[{features}][0:5]: \n")
display(df[features].head(5))
print("---- Testing with `make_pipeline`---------")
from sklearn.pipeline import make_pipeline
test_pipeline = make_pipeline(BureauBalanceFeaturesAgg(bb, bur))
return(test_pipeline.fit_transform(df))
# All features of previous applications .....
features = ['AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CNT_PAYMENT',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION']
# Features of interest.....
import gc
gc.collect()
res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
print("\n\n----- Results ----------")
print(f"Test driver: \n")
display(res.head(10))
print(f"input[features][0:10]: \n")
display(appsDF.head(10))
class prevAppsFeaturesAggregater(BaseEstimator, TransformerMixin):
def __init__(self, features=None, prevApp=1): # no *args or **kargs
self.prevApp=prevApp
self.features = features
self.agg_op_features = {}
for f in features:
self.agg_op_features[f]=[]
self.agg_op_features[f].extend((f"{f}_{func}",func) for func in ["min", "max", "mean"])
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
####################-- Python Debugging---################################
# from IPython.core.debugger
# import Pdb as pdb
# pdb().set_trace()
# breakpoint dont forget to quit
###########################################################
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
result.columns = result.columns.droplevel()
result = result.reset_index(level=["SK_ID_CURR"])
if self.prevApp:
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
return result
# todo ---
# return dataframe with the join key "SK_ID_CURR"
def test_driver_prevAppsFeaturesAggregater(df, features):
print("Executing the test driver............")
print(f"df.shape: {df.shape}\n")
print(f"df[{features}][0:5]: \n")
display(df[features].head(5))
print("---- Testing with `make_pipeline`---------")
test_pipeline = make_pipeline(prevAppsFeaturesAggregater(features))
return(test_pipeline.fit_transform(df))
# All features of previous applications .....
features = ['AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CNT_PAYMENT',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION']
# Features of interest.....
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
print("\n\n----- Results ----------")
print(f"Test driver: \n")
display(res.head(10))
print(f"input[features][0:10]: \n")
display(appsDF.head(10))
# QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
class prevAppsFeaturesAggregater(BaseEstimator, TransformerMixin):
def __init__(self, features=None, prevApp=1): # no *args or **kargs
self.prevApp=prevApp
self.features = features
self.agg_op_features = {}
for f in features:
self.agg_op_features[f]=[]
self.agg_op_features[f].extend((f"{f}_{func}",func) for func in ["min", "max", "mean"])
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
####################-- Python Debugging---################################
# from IPython.core.debugger
# import Pdb as pdb
# pdb().set_trace()
# breakpoint dont forget to quit
###########################################################
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
result.columns = result.columns.droplevel()
result = result.reset_index(level=["SK_ID_CURR"])
if self.prevApp:
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
return result
# todo ---
# return dataframe with the join key "SK_ID_CURR"
def test_driver_prevAppsFeaturesAggregater(df, features):
print("Executing the test driver............")
print(f"df.shape: {df.shape}\n")
print(f"df[{features}][0:5]: \n")
display(df[features].head(5))
print("---- Testing with `make_pipeline`---------")
test_pipeline = make_pipeline(prevAppsFeaturesAggregater(features))
return(test_pipeline.fit_transform(df))
# All features of previous applications .....
features = ['AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CNT_PAYMENT',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION']
# Features of interest.....
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
print("\n\n----- Results ----------")
print(f"Test driver: \n")
display(res.head(10))
print(f"input[features][0:10]: \n")
display(appsDF.head(10))
# QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
from sklearn.base import BaseEstimator, TransformerMixin
import re
# Creates the following date features
# But could do so much more with these features
# E.g.,
# extract the domain address of the homepage and OneHotEncode it
#
# ['release_month','release_day','release_year', 'release_dayofweek','release_quarter']
class prep_OCCUPATION_TYPE(BaseEstimator, TransformerMixin):
def __init__(self, features="OCCUPATION_TYPE"): # no *args or **kargs
self.features = features
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X):
df = pd.DataFrame(X, columns=self.features)
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].apply(lambda x: 1. if x in ['Core Staff', 'Accountants', 'Managers', 'Sales Staff', 'Medicine Staff', 'High Skill Tech Staff', 'Realty Agents', 'IT Staff', 'HR Staff'] else 0.)
#df.drop(self.features, axis=1, inplace=True)
return np.array(df.values) #return a Numpy Array to observe the pipeline protocol
from sklearn.pipeline import make_pipeline
features = ["OCCUPATION_TYPE"]
def test_driver_prep_OCCUPATION_TYPE():
print(f"X_train.shape: {X_train.shape}\n")
print(f"X_train['name'][0:5]: \n{X_train[features][0:5]}")
test_pipeline = make_pipeline(prep_OCCUPATION_TYPE(features))
return(test_pipeline.fit_transform(X_train))
x = test_driver_prep_OCCUPATION_TYPE()
print(f"Test driver: \n{test_driver_prep_OCCUPATION_TYPE()[0:10, :]}")
print(f"X_train['name'][0:10]: \n{X_train[features][0:10]}")
# QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
# Convert categorical features to numerical approximations (via pipeline)
class ClaimAttributesAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
charlson_idx_dt = {'0': 0, '1-2': 2, '3-4': 4, '5+': 6}
los_dt = {'1 day': 1, '2 days': 2, '3 days': 3, '4 days': 4, '5 days': 5, '6 days': 6,
'1- 2 weeks': 11, '2- 4 weeks': 21, '4- 8 weeks': 42, '26+ weeks': 180}
X['PayDelay'] = X['PayDelay'].apply(lambda x: int(x) if x != '162+' else int(162))
X['DSFS'] = X['DSFS'].apply(lambda x: None if pd.isnull(x) else int(x[0]) + 1)
X['CharlsonIndex'] = X['CharlsonIndex'].apply(lambda x: charlson_idx_dt[x])
X['LengthOfStay'] = X['LengthOfStay'].apply(lambda x: None if pd.isnull(x) else los_dt[x])
return X
# Mean number of previous point of sale or cash loans where payment is past due date
agg_data = pcb.groupby(["SK_ID_CURR", "SK_ID_PREV"])["SK_DPD"].sum().reset_index()
agg_data.groupby(["SK_ID_CURR"])["SK_DPD"].mean().reset_index().rename({"SK_DPD": "SK_DPD_NORMALIZED_MEAN"},axis=1)
| SK_ID_CURR | SK_DPD_NORMALIZED_MEAN | |
|---|---|---|
| 0 | 100001 | 3.500000 |
| 1 | 100002 | 0.000000 |
| 2 | 100003 | 0.000000 |
| 3 | 100004 | 0.000000 |
| 4 | 100005 | 0.000000 |
| ... | ... | ... |
| 337247 | 456251 | 0.000000 |
| 337248 | 456252 | 0.000000 |
| 337249 | 456253 | 1.666667 |
| 337250 | 456254 | 0.000000 |
| 337251 | 456255 | 0.833333 |
337252 rows × 2 columns
agg_data = pcb.groupby(["SK_ID_CURR", "SK_ID_PREV"])["SK_DPD_DEF"].sum().reset_index()
agg_data.groupby(["SK_ID_CURR"])["SK_DPD_DEF"].max().reset_index().rename({"SK_DPD_DEF": "SK_DPD_DEF_NORMALIZED_MAX"},axis=1)
| SK_ID_CURR | SK_DPD_DEF_NORMALIZED_MAX | |
|---|---|---|
| 0 | 100001 | 7 |
| 1 | 100002 | 0 |
| 2 | 100003 | 0 |
| 3 | 100004 | 0 |
| 4 | 100005 | 0 |
| ... | ... | ... |
| 337247 | 456251 | 0 |
| 337248 | 456252 | 0 |
| 337249 | 456253 | 5 |
| 337250 | 456254 | 0 |
| 337251 | 456255 | 5 |
337252 rows × 2 columns
ip_grp = ip.groupby(["SK_ID_PREV","SK_ID_CURR"])
difference_days = ip.groupby(["SK_ID_PREV","SK_ID_CURR"]).apply(lambda row: row.DAYS_INSTALMENT - row.DAYS_ENTRY_PAYMENT).sum(level=[0,1]).reset_index().rename({0:"PAYMENT_DIFFERENCE_DAYS"})
difference_days
class FeaturesAggregater(BaseEstimator, TransformerMixin):
def __init__(self, ds, features, groupby_col, agg_previous_features=False):
self.dataset = ds
self.features = features
self.groupby = groupby_col
self.agg_op_features = {}
self.agg_previous_features = agg_previous_features
for f in features:
self.agg_op_features[f]=[]
self.agg_op_features[f].extend((f"{f}_{func}",func) for func in ["min", "max", "mean"])
print("Called Basic Feature Aggregator")
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# X is application table
_agg_dataset = self.dataset.groupby([self.groupby]).agg(self.agg_op_features)
_agg_dataset.columns = _agg_dataset.columns.droplevel()
_agg_dataset = _agg_dataset.reset_index(level=[self.groupby])
_agg_dataset.fillna(0, inplace=True)
if self.agg_previous_features:
result = X.merge(_agg_dataset, on="SK_ID_CURR", how="left")
del _agg_dataset
gc.collect()
return result
result = _agg_dataset
# new_features = [y[0] for x in self.agg_op_features.values() for y in x]
# result = X.merge(_agg_dataset, on="SK_ID_CURR", how="left")
# result[new_features].fillna(0, inplace=True)
return result
class BureauFeaturesAgg(BaseEstimator, TransformerMixin):
def __init__(self): # no *args or **kargs
print("Called Feature Aggregator for Datasets : `Bureau`")
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
self.bur = X
# Total number of Past loans ?, Dataset: Bureau
past_loans = self.bur[["SK_ID_CURR", "CREDIT_ACTIVE"]].groupby("SK_ID_CURR").count().reset_index().rename(columns = {'CREDIT_ACTIVE':'TOTAL_PAST_LOANS'})
# Total types of loan, Dataset: Bureau
types_of_loan = self.bur[["SK_ID_CURR", "CREDIT_TYPE"]].groupby("SK_ID_CURR").nunique().reset_index().rename(columns = {'CREDIT_TYPE':'TOTAL_TYPES_OF_LOAN'})
# % of active loans, Dataset: Bureau
self.bur["is_credit_active"]= self.bur[["CREDIT_ACTIVE"]].apply(func= lambda x: False if x.CREDIT_ACTIVE=="Closed" else True, axis=1)
active_loans_mean = self.bur[["SK_ID_CURR", "is_credit_active"]].groupby("SK_ID_CURR").mean().reset_index().rename(columns = {'is_credit_active':'ACTIVE_LOANS_MEAN'})
# average of (days to credit end) for active credit. , Dataset: Bureau
with_active_credits = self.bur[self.bur["is_credit_active"]]
# if len(with_active_credits):
days_to_credit_end_mean = with_active_credits[["SK_ID_CURR", "DAYS_CREDIT_ENDDATE"]].groupby("SK_ID_CURR").mean().reset_index().rename(columns={'DAYS_CREDIT_ENDDATE': 'DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN'})
days_to_credit_end_mean["DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN"] = days_to_credit_end_mean["DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN"].apply(lambda x: 0 if x/365 < 0 else x/365)
# mean amount of prolonged credits
_max_overdue = self.bur[~self.bur["AMT_CREDIT_MAX_OVERDUE"].isna()]
max_overdue = _max_overdue.groupby("SK_ID_CURR")["AMT_CREDIT_MAX_OVERDUE"].mean().reset_index().rename(columns={'AMT_CREDIT_MAX_OVERDUE': 'AMT_CREDIT_MAX_OVERDUE_NORMALIZED_MEAN'})
# % of utilized debt??
def mean_debt(*arg):
return arg[0].AMT_CREDIT_SUM_DEBT/ (arg[0].AMT_CREDIT_SUM - arg[0].AMT_CREDIT_SUM_OVERDUE)
_max_overdue_active = with_active_credits[~self.bur["AMT_CREDIT_MAX_OVERDUE"].isna()]
r_cols = ["AMT_CREDIT_SUM", "AMT_CREDIT_SUM_DEBT", "AMT_CREDIT_SUM_OVERDUE"]
max_overdue_active = _max_overdue_active.groupby("SK_ID_CURR")[r_cols].sum().reset_index()
max_overdue_active["UTILIZED_DEBT"] = max_overdue_active.apply(mean_debt, axis=1)
max_overdue_active = max_overdue_active.drop(r_cols, axis=1)
# _result_1 = X.merge(past_loans, on="SK_ID_CURR", how="left")
_result_2 = past_loans.merge(types_of_loan, on="SK_ID_CURR", how="left")
_result_3 = _result_2.merge(active_loans_mean, on="SK_ID_CURR", how="left")
_result_4 = _result_3.merge(days_to_credit_end_mean, on="SK_ID_CURR", how="left")
_result_5 = _result_4.merge(max_overdue, on="SK_ID_CURR", how="left")
result = _result_5.merge(max_overdue_active, on="SK_ID_CURR", how="left")
result.drop_duplicates(inplace=True)
del types_of_loan,active_loans_mean, days_to_credit_end_mean, max_overdue, max_overdue_active
gc.collect()
new_cols = ["TOTAL_PAST_LOANS", "TOTAL_TYPES_OF_LOAN", "ACTIVE_LOANS_MEAN", "DAYS_CREDIT_ENDDATE_NORMALIZED_MEAN", "AMT_CREDIT_MAX_OVERDUE_NORMALIZED_MEAN", "UTILIZED_DEBT"]
result[new_cols].fillna(0, inplace=True)
return result
class BureauBalanceFeaturesAgg(BaseEstimator, TransformerMixin):
def __init__(self, bur_dataset):
print("Called Feature Aggregator for Datasets : `Bureau Balance`")
self.bur = bur_dataset
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
########################################################################
_bureau_data = self.bur[["SK_ID_CURR", "SK_ID_BUREAU"]]
all_counts = X.groupby("SK_ID_BUREAU")["STATUS"].count()
# Count of completed payments
_completed_status_count = X[(X.STATUS =="C") | (X.STATUS =="0")].groupby("SK_ID_BUREAU")["STATUS"].count()
completed_status = _completed_status_count.reset_index().rename(columns={"STATUS": "STATUS_COMPLETED_COUNT"}).fillna(0)
_result_1 = _bureau_data.merge(completed_status, how="left", on="SK_ID_BUREAU")
del completed_status
gc.collect()
# mean of closed or completed records
_mean_completed_status = _completed_status_count / all_counts
mean_completed_status = _mean_completed_status.reset_index().rename(columns={"STATUS": "STATUS_COMPLETED_MEAN"}).fillna(0)
_result_2 = _result_1.merge(mean_completed_status, how="left", on="SK_ID_BUREAU")
del _result_1, mean_completed_status, _mean_completed_status, _completed_status_count
gc.collect()
# count of records which are past due date
cond = (X.STATUS == "3") | (X.STATUS == "4") | (X.STATUS == "5")
_past_due_status_count = X[cond].groupby("SK_ID_BUREAU")["STATUS"].count()
past_due_status_count = _past_due_status_count.reset_index().rename(columns={"STATUS": "STATUS_PAST_DUE_COUNT"})
_result_3 = _result_2.merge(past_due_status_count, how="left", on="SK_ID_BUREAU")
del _result_2, _past_due_status_count, cond
gc.collect()
# Memory Overflow
# % of records with due past dud date
# _mean_past_status = past_due_status_count/ all_counts
# mean_past_status = _mean_past_status.reset_index().rename(columns={"STATUS": "STATUS_PAST_DUE_MEAN"}).fillna(0)
# _result_4 = _result_3.merge(mean_past_status, how="left", on="SK_ID_BUREAU")
# del _result_3, _mean_past_status, past_due_status_count, mean_past_status
# gc.collect()
# # % of records where status is unknown
# _status_unknown = self.bb[self.bb.STATUS == "X"].groupby("SK_ID_BUREAU")["STATUS"].count() / all_counts
# status_unknown = _status_unknown.reset_index().rename(columns={"STATUS": "STATUS_PAST_DUE_UNKNOWN_MEAN"}).fillna(0)
# _result_5 = _result_4.merge(status_unknown, how="left", on="SK_ID_BUREAU")
# del _result_4, _status_unknown, status_unknown
# gc.collect()
new_cols = ["STATUS_COMPLETED_COUNT", "STATUS_COMPLETED_MEAN", "STATUS_PAST_DUE_COUNT"]
result = _result_3.groupby("SK_ID_CURR").sum()[new_cols].reset_index()
result.drop_duplicates(inplace=True)
# Merge with original table
# result = X.merge(_result_6, on="SK_ID_CURR", how="left")
# Not Included: "STATUS_PAST_DUE_MEAN", "STATUS_PAST_DUE_UNKNOWN_MEAN", "STATUS_PAST_DUE_MEAN"
result[new_cols].fillna(0, inplace=True)
return result
class CreditCardBalanceFeaturesAgg(BaseEstimator, TransformerMixin):
def __init__(self):
print("Called Feature Aggregator for Datasets : `Credit Card Balance`")
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
########################################################################
# Total number of Past loans ?, Dataset: CCB
past_loans = X.groupby(by = ['SK_ID_CURR'])['SK_ID_PREV'].nunique().reset_index().rename(columns = {'SK_ID_PREV': 'TOTAL_CREDIT_LOANS'})
# Maxium number of installments per loan
max_no_credit_install = X.groupby(by =['SK_ID_CURR', 'SK_ID_PREV'])['CNT_INSTALMENT_MATURE_CUM'].max().reset_index().rename(columns = {'CNT_INSTALMENT_MATURE_CUM': 'MAX_NO_CREDIT_INSTALMENTS'})
_result_1 = max_no_credit_install.merge(past_loans, on="SK_ID_CURR", how="left")
# Total number of installments per loan
total_credit_installments = max_no_credit_install.groupby(["SK_ID_CURR"]).sum().MAX_NO_CREDIT_INSTALMENTS.reset_index().rename(columns = {'MAX_NO_CREDIT_INSTALMENTS': 'TOTAL_CREDIT_INSTALLMENTS'})
_result_2 = _result_1.merge(total_credit_installments, on="SK_ID_CURR", how="left")
# Mean of installements
_result_2['INSTALLMENTS_PER_LOAN'] = (_result_2['TOTAL_CREDIT_INSTALLMENTS']/_result_2['TOTAL_CREDIT_LOANS']).astype('int')
del past_loans, max_no_credit_install, total_credit_installments, _result_1
gc.collect()
# total & mean installments past due date
def past_due_date_count(*arg):
return len(list(filter(lambda x: x!=0, arg[0].SK_DPD.values)))
total_past_due_date = X.groupby(["SK_ID_CURR","SK_ID_PREV"]).apply(past_due_date_count).reset_index().rename({0:"TOTAL_PAST_DUE_DATE"},axis=1)
mean_past_due_date = total_past_due_date.groupby(["SK_ID_CURR","SK_ID_PREV"]).TOTAL_PAST_DUE_DATE.mean().reset_index().rename({0:"MEAN_PAST_DUE_DATE"},axis=1)
grouped_data = ccb.groupby(["SK_ID_CURR", "SK_ID_PREV"])
r_cols = ["AMT_DRAWINGS_ATM_CURRENT","AMT_DRAWINGS_CURRENT","AMT_DRAWINGS_OTHER_CURRENT","AMT_DRAWINGS_POS_CURRENT"]
mean_spending_data = grouped_data[r_cols].mean().fillna(0).reset_index().rename({ _ : _ +"_MEAN" for _ in r_cols}, axis=1)
_result_3 = total_past_due_date.merge(mean_past_due_date, on=["SK_ID_CURR","SK_ID_PREV"], how="left")
_result_4 = _result_3.merge(mean_spending_data, on=["SK_ID_CURR","SK_ID_PREV"], how="left")
_result_5 = _result_2.merge(_result_4, on=["SK_ID_CURR","SK_ID_PREV"], how="left")
_result_5.fillna(0, inplace=True)
_result_5.drop_duplicates(inplace=True)
result = _result_5.groupby(["SK_ID_CURR"]).sum().drop("SK_ID_PREV", axis=1).reset_index()
del total_past_due_date, mean_past_due_date, _result_2, _result_3
gc.collect()
return result
class POSCashBalanceFeaturesAgg(BaseEstimator, TransformerMixin):
def __init__(self):
print("Called Feature Aggregator for Datasets : `POS Cash Balance`")
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# Mean number of previous point of sale or cash loans where payment is past due date
_sk_dpd_mean = X.groupby(["SK_ID_CURR", "SK_ID_PREV"])["SK_DPD"].sum().reset_index()
sk_dpd_mean = _sk_dpd_mean.groupby(["SK_ID_CURR"])["SK_DPD"].mean().reset_index().rename({"SK_DPD": "SK_DPD_NORMALIZED_MEAN"},axis=1)
_sk_dpd_def_max = X.groupby(["SK_ID_CURR", "SK_ID_PREV"])["SK_DPD_DEF"].sum().reset_index()
sk_dpd_def_max = _sk_dpd_def_max.groupby(["SK_ID_CURR"])["SK_DPD_DEF"].max().reset_index().rename({"SK_DPD_DEF": "SK_DPD_DEF_NORMALIZED_MAX"},axis=1)
result = sk_dpd_mean.merge(sk_dpd_def_max, on="SK_ID_CURR", how="left")
result.fillna(0, inplace=True)
result.drop_duplicates(inplace=True)
del _sk_dpd_def_max, sk_dpd_def_max, sk_dpd_mean, _sk_dpd_mean
gc.collect()
return result
class Agg_Secondary_table(object):
@classmethod
def transform(cls, all_tables, X=None, merge_all_data=True):
if X is None:
X = all_tables.get("application_train", None)
if X is None:
raise ValueError("Please provide either train or test dataset")
print("-+-+-"*10)
print("Using Application Train data.....")
print("-+-+-"*10)
pa = all_tables["previous_application"]
ip = all_tables["installments_payments"]
pcb = all_tables["POS_CASH_balance"]
ccb = all_tables["credit_card_balance"]
bur = all_tables["bureau"]
bb = all_tables["bureau_balance"]
# Define all necessary features........
pa_features = ['AMT_ANNUITY', 'AMT_APPLICATION']
bureau_features = [
'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT',
'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE',
'AMT_ANNUITY'
]
ccb_features = ['AMT_BALANCE', 'CNT_INSTALMENT_MATURE_CUM']
# ip_features = ['AMT_INSTALMENT', 'AMT_PAYMENT']
# Pipeline starts.....
bureau_feature_pipeline = Pipeline([
("bureau_new_features", BureauFeaturesAgg()),
(
'feature_aggregater',
FeaturesAggregater(bur, bureau_features, "SK_ID_CURR", True)
),
])
bb_feature_pipeline = Pipeline([
("bureau_balance_new_features", BureauBalanceFeaturesAgg(bur))
])
prevApps_feature_pipeline = Pipeline([
(
'prevApps_aggregater',
FeaturesAggregater(pa, pa_features, "SK_ID_CURR", False)
),
])
ccb_feature_pipeline = Pipeline([
('credit_card_balance_new_features', CreditCardBalanceFeaturesAgg()),
(
'feature_aggregater',
FeaturesAggregater(ccb, ccb_features, "SK_ID_CURR", True)
),
])
pcb_feature_pipeline = Pipeline([
('POS_cash_balance_new_features', POSCashBalanceFeaturesAgg()),
])
# ip_feature_pipeline = Pipeline([
# # ('prevApps_add_features1', prevApps_add_features1()), # add some new features
# # ('prevApps_add_features2', prevApps_add_features2()), # add some new features
# ('feature_aggregater', prevAppsFeaturesAggregater(ip_features,prevApp=0)), # Aggregate across old and new features
# ])
if merge_all_data:
prevApps_aggregated = prevApps_feature_pipeline.transform(pa)
bureau_aggregated = bureau_feature_pipeline.transform(bur)
bb_aggregated = bb_feature_pipeline.transform(bb)
ccb_aggregated = ccb_feature_pipeline.transform(ccb)
pcb_aggregated = pcb_feature_pipeline.transform(pcb)
# merge primary table and secondary tables using features based on meta data
# and aggregage stats
print("Original Data Details (Rows, Columns): ", X.shape)
# 1. Join/Merge in bureau Data
X = X.merge(bureau_aggregated, how='left', on='SK_ID_CURR')
print("After Adding New Features Data Details (Rows, Columns): ", X.shape)
# 2. Join/Merge in previous_application Data
X = X.merge(prevApps_aggregated, how='left', on="SK_ID_CURR")
print("After Adding New Features Data Details (Rows, Columns): ", X.shape)
# 3. Join/Merge in Aggregated POS_CASH_balance Data
X = X.merge(pcb_aggregated, how='left', on="SK_ID_CURR")
print("After Adding New Features Data Details (Rows, Columns): ", X.shape)
# 4. Join/Merge in bureau_balance Data
X = X.merge(bb_aggregated, how='left', on="SK_ID_CURR")
print("After Adding New Features Data Details (Rows, Columns): ", X.shape)
# 5. Join/Merge in Aggregated credit_card_balance Data
X = X.merge(ccb_aggregated, how='left', on="SK_ID_CURR")
print("After Adding New Features Data Details (Rows, Columns): ", X.shape)
del bureau_aggregated,bb_aggregated, prevApps_aggregated, ccb_aggregated, pcb_aggregated
gc.collect()
print("-+-+-"*10)
print("Aggregated data .....")
display(X[-5:].head(5))
return X
class DataFrameSelector(BaseEstimator, TransformerMixin):
"""
Create a class to select numerical or categorical columns
since Scikit-Learn doesn't handle DataFrames yet.
"""
def __init__(self, attribute_names): self.attribute_names = attribute_names
def fit(self, X, y=None): return self
def transform(self, X): return X[self.attribute_names].values
class Estimatorstub(object):
"""
# Class to for proxy estimator.
"""
def fit(self, X, y=None): return self
def transform(self, X, y=None): return self
class FeatureSelectionstub(object):
"""
# Class to for proxy Feature Selector.
"""
def fit(self, X, y=None): return self
def transform(self, X, y=None): return self
# Identify the numeric features we wish to consider.
num_attribs = [
'AMT_INCOME_TOTAL', 'AMT_CREDIT',
'DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1',
'EXT_SOURCE_2','EXT_SOURCE_3'
]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('numeric_imputer', SimpleImputer(strategy='mean')),
("scaling", StandardScaler())
])
# Identify the categorical features we wish to consider.
cat_attribs = [
"NAME_CONTRACT_TYPE",
"NAME_TYPE_SUITE",
"NAME_INCOME_TYPE",
"NAME_EDUCATION_TYPE",
"NAME_FAMILY_STATUS",
"NAME_HOUSING_TYPE",
"OCCUPATION_TYPE",
"WEEKDAY_APPR_PROCESS_START",
"HOUR_APPR_PROCESS_START",
"ORGANIZATION_TYPE",
"CODE_GENDER",
"FLAG_OWN_CAR",
"FLAG_OWN_REALTY"
]
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
#('imputer', SimpleImputer(strategy='most_frequent')),
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_prep_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
set_config(display="diagram")
data_prep_pipeline
FeatureUnion(transformer_list=[('num_pipeline',
Pipeline(steps=[('selector',
DataFrameSelector(attribute_names=['AMT_INCOME_TOTAL',
'AMT_CREDIT',
'DAYS_EMPLOYED',
'DAYS_BIRTH',
'EXT_SOURCE_1',
'EXT_SOURCE_2',
'EXT_SOURCE_3'])),
('numeric_imputer',
SimpleImputer()),
('scaling',
StandardScaler())])),
('cat_pipeline',
Pipeline(steps=[('selector',
DataFrameSelector(a...
'NAME_INCOME_TYPE',
'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS',
'NAME_HOUSING_TYPE',
'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START',
'HOUR_APPR_PROCESS_START',
'ORGANIZATION_TYPE',
'CODE_GENDER',
'FLAG_OWN_CAR',
'FLAG_OWN_REALTY'])),
('imputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('ohe',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]))])Please rerun this cell to show the HTML repr or trust the notebook.FeatureUnion(transformer_list=[('num_pipeline',
Pipeline(steps=[('selector',
DataFrameSelector(attribute_names=['AMT_INCOME_TOTAL',
'AMT_CREDIT',
'DAYS_EMPLOYED',
'DAYS_BIRTH',
'EXT_SOURCE_1',
'EXT_SOURCE_2',
'EXT_SOURCE_3'])),
('numeric_imputer',
SimpleImputer()),
('scaling',
StandardScaler())])),
('cat_pipeline',
Pipeline(steps=[('selector',
DataFrameSelector(a...
'NAME_INCOME_TYPE',
'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS',
'NAME_HOUSING_TYPE',
'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START',
'HOUR_APPR_PROCESS_START',
'ORGANIZATION_TYPE',
'CODE_GENDER',
'FLAG_OWN_CAR',
'FLAG_OWN_REALTY'])),
('imputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('ohe',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]))])DataFrameSelector(attribute_names=['AMT_INCOME_TOTAL', 'AMT_CREDIT',
'DAYS_EMPLOYED', 'DAYS_BIRTH',
'EXT_SOURCE_1', 'EXT_SOURCE_2',
'EXT_SOURCE_3'])SimpleImputer()
StandardScaler()
DataFrameSelector(attribute_names=['NAME_CONTRACT_TYPE', 'NAME_TYPE_SUITE',
'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START',
'HOUR_APPR_PROCESS_START',
'ORGANIZATION_TYPE', 'CODE_GENDER',
'FLAG_OWN_CAR', 'FLAG_OWN_REALTY'])SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore', sparse=False)
X_train, X_valid, X_test, y_train, y_valid, y_test = load_train_valid_test_data(list_of_features=None)
X_train_agg = Agg_Secondary_table.transform(datasets, X=X_train)
X_valid_agg = Agg_Secondary_table.transform(datasets, X=X_valid)
X_test_agg = Agg_Secondary_table.transform(datasets, X=X_test)
-+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+- Using Application data with selected features ...vvv ['SK_ID_CURR', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED', 'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'CODE_GENDER', 'FLAG_OWN_REALTY', 'FLAG_OWN_CAR', 'NAME_CONTRACT_TYPE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'NAME_INCOME_TYPE', 'NAME_TYPE_SUITE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'ORGANIZATION_TYPE'] -+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+- ------------------------------------------------- X train shape: (222176, 21) X validation shape: (46127, 21) X test shape: (39208, 21) X X_kaggle_test shape: (48744, 21) Y train shape: (222176,) Y validation shape: (46127,) Y test shape: (39208,) Called Feature Aggregator for Datasets : `Bureau` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `Bureau Balance` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `Credit Card Balance` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `POS Cash Balance` Original Data Details (Rows, Columns): (222176, 21) After Adding New Features Data Details (Rows, Columns): (222176, 42) After Adding New Features Data Details (Rows, Columns): (222176, 48) After Adding New Features Data Details (Rows, Columns): (222176, 50) After Adding New Features Data Details (Rows, Columns): (222176, 53) After Adding New Features Data Details (Rows, Columns): (222176, 69) -+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+- Aggregated data .....
| SK_ID_CURR | AMT_INCOME_TOTAL | AMT_CREDIT | DAYS_EMPLOYED | DAYS_BIRTH | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | CODE_GENDER | FLAG_OWN_REALTY | ... | AMT_DRAWINGS_ATM_CURRENT_MEAN | AMT_DRAWINGS_CURRENT_MEAN | AMT_DRAWINGS_OTHER_CURRENT_MEAN | AMT_DRAWINGS_POS_CURRENT_MEAN | AMT_BALANCE_min | AMT_BALANCE_max | AMT_BALANCE_mean | CNT_INSTALMENT_MATURE_CUM_min | CNT_INSTALMENT_MATURE_CUM_max | CNT_INSTALMENT_MATURE_CUM_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 222171 | 267117 | 270000.0 | 1762110.0 | -7218 | -23554 | 0.748672 | 0.679988 | 0.553165 | F | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 222172 | 138205 | 112500.0 | 284400.0 | -382 | -9958 | 0.297779 | 0.394895 | NaN | M | N | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 222173 | 204966 | 45000.0 | 180000.0 | -4429 | -12008 | NaN | 0.671937 | 0.273565 | F | N | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 222174 | 385249 | 202500.0 | 1736937.0 | -573 | -10209 | NaN | 0.086790 | 0.520898 | F | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 222175 | 345838 | 58500.0 | 157500.0 | -2074 | -8751 | NaN | 0.363715 | 0.368969 | F | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 69 columns
Called Feature Aggregator for Datasets : `Bureau` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `Bureau Balance` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `Credit Card Balance` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `POS Cash Balance` Original Data Details (Rows, Columns): (46127, 21) After Adding New Features Data Details (Rows, Columns): (46127, 42) After Adding New Features Data Details (Rows, Columns): (46127, 48) After Adding New Features Data Details (Rows, Columns): (46127, 50) After Adding New Features Data Details (Rows, Columns): (46127, 53) After Adding New Features Data Details (Rows, Columns): (46127, 69) -+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+- Aggregated data .....
| SK_ID_CURR | AMT_INCOME_TOTAL | AMT_CREDIT | DAYS_EMPLOYED | DAYS_BIRTH | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | CODE_GENDER | FLAG_OWN_REALTY | ... | AMT_DRAWINGS_ATM_CURRENT_MEAN | AMT_DRAWINGS_CURRENT_MEAN | AMT_DRAWINGS_OTHER_CURRENT_MEAN | AMT_DRAWINGS_POS_CURRENT_MEAN | AMT_BALANCE_min | AMT_BALANCE_max | AMT_BALANCE_mean | CNT_INSTALMENT_MATURE_CUM_min | CNT_INSTALMENT_MATURE_CUM_max | CNT_INSTALMENT_MATURE_CUM_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 46122 | 313895 | 135000.0 | 266832.0 | -126 | -9621 | NaN | 0.654572 | 0.404878 | M | Y | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 46123 | 423071 | 315000.0 | 629325.0 | -4034 | -17995 | NaN | 0.701676 | 0.581484 | M | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 46124 | 217139 | 67500.0 | 127350.0 | -422 | -16085 | NaN | 0.677566 | 0.377404 | F | N | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 46125 | 161369 | 135000.0 | 168102.0 | -714 | -7736 | NaN | 0.175819 | NaN | M | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 46126 | 385178 | 225000.0 | 912240.0 | -2058 | -13100 | 0.848648 | 0.753290 | 0.719491 | F | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 69 columns
Called Feature Aggregator for Datasets : `Bureau` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `Bureau Balance` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `Credit Card Balance` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `POS Cash Balance` Original Data Details (Rows, Columns): (39208, 21) After Adding New Features Data Details (Rows, Columns): (39208, 42) After Adding New Features Data Details (Rows, Columns): (39208, 48) After Adding New Features Data Details (Rows, Columns): (39208, 50) After Adding New Features Data Details (Rows, Columns): (39208, 53) After Adding New Features Data Details (Rows, Columns): (39208, 69) -+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+- Aggregated data .....
| SK_ID_CURR | AMT_INCOME_TOTAL | AMT_CREDIT | DAYS_EMPLOYED | DAYS_BIRTH | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | CODE_GENDER | FLAG_OWN_REALTY | ... | AMT_DRAWINGS_ATM_CURRENT_MEAN | AMT_DRAWINGS_CURRENT_MEAN | AMT_DRAWINGS_OTHER_CURRENT_MEAN | AMT_DRAWINGS_POS_CURRENT_MEAN | AMT_BALANCE_min | AMT_BALANCE_max | AMT_BALANCE_mean | CNT_INSTALMENT_MATURE_CUM_min | CNT_INSTALMENT_MATURE_CUM_max | CNT_INSTALMENT_MATURE_CUM_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39203 | 295615 | 67500.0 | 73944.0 | -1800 | -16215 | 0.586475 | 0.468402 | 0.506484 | F | Y | ... | 0.0 | 17637.895 | 0.0 | 19842.631875 | 0.0 | 137628.045 | 76754.74 | 0.0 | 7.0 | 3.111111 |
| 39204 | 295863 | 90000.0 | 239850.0 | 365243 | -20309 | NaN | 0.645764 | 0.362277 | F | N | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 39205 | 123795 | 252000.0 | 780363.0 | -937 | -13572 | 0.305887 | 0.509765 | NaN | F | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 39206 | 100669 | 81000.0 | 670500.0 | 365243 | -22990 | 0.527931 | 0.679812 | 0.511892 | M | N | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 39207 | 453575 | 315000.0 | 900000.0 | -1444 | -15197 | NaN | 0.650462 | 0.215182 | M | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 69 columns
##### Testing
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("feature_selection", FeatureSelectionstub()),
("clf", Estimatorstub()),
])
# Todo- can we try different imputation and scaling in param grid??
param_grid = [
{
"preparation__num_pipeline__scaling": [
'passthrough',
Normalizer(norm="l2"),
Normalizer(norm="l1"),
Normalizer(norm="max"),
StandardScaler(),
],
"feature_selection": (SelectKBest(),),
"feature_selection__k": [15, 20, 25, 30],
# "feature_selection__score_func": [chi2,r_regression],
'clf': (LogisticRegression(),),
"clf__penalty": ["l1", "l2", "elasticnet"],
"clf__l1_ratio": [0.25, 0.5,0.75]
}
]
# RandomizedSearchCV
# GridSearchCV
gsv = RandomizedSearchCV(
full_pipeline_with_predictor, param_grid,
cv=3, n_jobs=-1, verbose=2, return_train_score=True, scoring="roc_auc"
)
model = gsv.fit(X_train_agg, y_train)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
print("The best roc_auc_score is: {}".format(model.best_score_))
print("------ The best parameters are: {}".format(model.best_params_))
print("The accuracy score of this model is:{}".format(np.round(accuracy_score(y_train, model.predict(X_train_agg)), 3)))
print("\n\nGrid search Results:-----")
The best roc_auc_score is: 0.7352625429277486
------ The best parameters are: {'preparation__num_pipeline__scaling': StandardScaler(), 'feature_selection__k': 30, 'feature_selection': SelectKBest(k=30), 'clf__penalty': 'l2', 'clf__l1_ratio': 0.75, 'clf': LogisticRegression(l1_ratio=0.75)}
The accuracy score of this model is:0.92
Grid search Results:-----
To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model
%%time
np.random.seed(42)
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("linear", LogisticRegression())
])
model = full_pipeline_with_predictor.fit(X_train, y_train)
CPU times: user 6.06 s, sys: 2.22 s, total: 8.29 s Wall time: 4.56 s
%%time
expLog_columns = ["exp_name","Train Acc", "Valid Acc","Test Acc","Train AUC", "Valid AUC","Test AUC"]
X_train, X_valid, X_test, y_train, y_valid, y_test = load_train_valid_test_data(list_of_features=None)
X_train_agg = Agg_Secondary_table.transform(datasets, X=X_train)
X_valid_agg = Agg_Secondary_table.transform(datasets, X=X_valid)
X_test_agg = Agg_Secondary_table.transform(datasets, X=X_test)
print("-------------------------------------------\n\n Grid Search......\n\n")
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("feature_selection", FeatureSelectionstub()),
("clf", Estimatorstub()),
])
param_grid = [
{
"preparation__num_pipeline__scaling": [
'passthrough',
Normalizer(norm="max"),
Normalizer(norm="l2"),
StandardScaler()
],
"feature_selection": (SelectKBest(),),
"feature_selection__k": [30, 35, 40],
"feature_selection__score_func": [r_regression],
'clf': (LogisticRegression(),),
"clf__penalty": ["l1", "l2", "elasticnet"],
"clf__l1_ratio": [0.25, 0.5, 0.75]
},
{
"feature_selection": (SelectKBest("all"),),
"clf": (DecisionTreeClassifier(),),
"clf__criterion":['gini','entropy'],
"clf__max_depth":range(1, 11,2)
},
{
"feature_selection": (SelectKBest("all"),),
"clf": (RandomForestClassifier(),),
'clf__bootstrap': [True,False],
'clf__max_depth': [10, 20],
'clf__max_features': [2, 3],
'clf__n_estimators': [100, 200],
},
{
"feature_selection": (SelectKBest("all"),),
"clf": (XGBClassifier(),),
"clf__learning_rate" : [0.05, 0.10, 0.20],
"clf__max_depth" : [1,3,5],
"clf__min_child_weight" : [ 1, 3, 5],
}
]
# RandomizedSearchCV
gsv = GridSearchCV(
full_pipeline_with_predictor, param_grid,
cv=3, n_jobs=-1, verbose=2, return_train_score=True, scoring="roc_auc"
)
model = gsv.fit(X_train_agg, y_train)
print("The best roc_auc_score is: {}".format(model.best_score_))
print("------ The best parameters are: {}".format(model.best_params_))
print("The accuracy score of this model is:{}".format(np.round(accuracy_score(y_train, model.predict(X_train_agg)), 3)))
print("\n\nGrid search Results:-----")
display(pd.DataFrame(model.cv_results_).sort_values(by="rank_test_score"))
print("---------------------------------------------------------")
print("Experiment results so far......")
exp_name = f"Gridserach_baseline{len(X_train_agg.columns)}_features"
try:
expLog
except NameError:
expLog = pd.DataFrame(columns=expLog_columns)
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[accuracy_score(y_train, model.predict(X_train_agg)),
accuracy_score(y_valid, model.predict(X_valid_agg)),
accuracy_score(y_test, model.predict(X_test_agg)),
roc_auc_score(y_train, model.predict_proba(X_train_agg)[:, 1]),
roc_auc_score(y_valid, model.predict_proba(X_valid_agg)[:, 1]),
roc_auc_score(y_test, model.predict_proba(X_test_agg)[:, 1])],
4))
display(expLog)
print("\n\n-----Historical Experiment Results......")
### Stroing the logs to file storage in case of kernel failure....
historical_logs = os.path.join(DATA_DIR, "expLog.csv")
if os.path.exists(historical_logs):
old_explog = pd.read_csv(historical_logs)
df = pd.concat([old_explog, expLog])
df.drop_duplicates(inplace=True)
df.to_csv(historical_logs, index=False)
else:
expLog.to_csv(historical_logs, index=False)
display(pd.read_csv(historical_logs).sort_values(by="Test AUC"))
-+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+- Using Application data with selected features ...vvv ['SK_ID_CURR', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED', 'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'CODE_GENDER', 'FLAG_OWN_REALTY', 'FLAG_OWN_CAR', 'NAME_CONTRACT_TYPE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'NAME_INCOME_TYPE', 'NAME_TYPE_SUITE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'ORGANIZATION_TYPE'] -+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+- ------------------------------------------------- X train shape: (222176, 21) X validation shape: (46127, 21) X test shape: (39208, 21) X X_kaggle_test shape: (48744, 21) Y train shape: (222176,) Y validation shape: (46127,) Y test shape: (39208,) Called Feature Aggregator for Datasets : `Bureau` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `Bureau Balance` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `Credit Card Balance` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `POS Cash Balance` Original Data Details (Rows, Columns): (222176, 21) After Adding New Features Data Details (Rows, Columns): (222176, 42) After Adding New Features Data Details (Rows, Columns): (222176, 48) After Adding New Features Data Details (Rows, Columns): (222176, 50) After Adding New Features Data Details (Rows, Columns): (222176, 53) After Adding New Features Data Details (Rows, Columns): (222176, 69) -+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+- Aggregated data .....
| SK_ID_CURR | AMT_INCOME_TOTAL | AMT_CREDIT | DAYS_EMPLOYED | DAYS_BIRTH | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | CODE_GENDER | FLAG_OWN_REALTY | ... | AMT_DRAWINGS_ATM_CURRENT_MEAN | AMT_DRAWINGS_CURRENT_MEAN | AMT_DRAWINGS_OTHER_CURRENT_MEAN | AMT_DRAWINGS_POS_CURRENT_MEAN | AMT_BALANCE_min | AMT_BALANCE_max | AMT_BALANCE_mean | CNT_INSTALMENT_MATURE_CUM_min | CNT_INSTALMENT_MATURE_CUM_max | CNT_INSTALMENT_MATURE_CUM_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 222171 | 267117 | 270000.0 | 1762110.0 | -7218 | -23554 | 0.748672 | 0.679988 | 0.553165 | F | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 222172 | 138205 | 112500.0 | 284400.0 | -382 | -9958 | 0.297779 | 0.394895 | NaN | M | N | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 222173 | 204966 | 45000.0 | 180000.0 | -4429 | -12008 | NaN | 0.671937 | 0.273565 | F | N | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 222174 | 385249 | 202500.0 | 1736937.0 | -573 | -10209 | NaN | 0.086790 | 0.520898 | F | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 222175 | 345838 | 58500.0 | 157500.0 | -2074 | -8751 | NaN | 0.363715 | 0.368969 | F | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 69 columns
Called Feature Aggregator for Datasets : `Bureau` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `Bureau Balance` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `Credit Card Balance` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `POS Cash Balance` Original Data Details (Rows, Columns): (46127, 21) After Adding New Features Data Details (Rows, Columns): (46127, 42) After Adding New Features Data Details (Rows, Columns): (46127, 48) After Adding New Features Data Details (Rows, Columns): (46127, 50) After Adding New Features Data Details (Rows, Columns): (46127, 53) After Adding New Features Data Details (Rows, Columns): (46127, 69) -+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+- Aggregated data .....
| SK_ID_CURR | AMT_INCOME_TOTAL | AMT_CREDIT | DAYS_EMPLOYED | DAYS_BIRTH | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | CODE_GENDER | FLAG_OWN_REALTY | ... | AMT_DRAWINGS_ATM_CURRENT_MEAN | AMT_DRAWINGS_CURRENT_MEAN | AMT_DRAWINGS_OTHER_CURRENT_MEAN | AMT_DRAWINGS_POS_CURRENT_MEAN | AMT_BALANCE_min | AMT_BALANCE_max | AMT_BALANCE_mean | CNT_INSTALMENT_MATURE_CUM_min | CNT_INSTALMENT_MATURE_CUM_max | CNT_INSTALMENT_MATURE_CUM_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 46122 | 313895 | 135000.0 | 266832.0 | -126 | -9621 | NaN | 0.654572 | 0.404878 | M | Y | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 46123 | 423071 | 315000.0 | 629325.0 | -4034 | -17995 | NaN | 0.701676 | 0.581484 | M | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 46124 | 217139 | 67500.0 | 127350.0 | -422 | -16085 | NaN | 0.677566 | 0.377404 | F | N | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 46125 | 161369 | 135000.0 | 168102.0 | -714 | -7736 | NaN | 0.175819 | NaN | M | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 46126 | 385178 | 225000.0 | 912240.0 | -2058 | -13100 | 0.848648 | 0.753290 | 0.719491 | F | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 69 columns
Called Feature Aggregator for Datasets : `Bureau` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `Bureau Balance` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `Credit Card Balance` Called Basic Feature Aggregator Called Feature Aggregator for Datasets : `POS Cash Balance` Original Data Details (Rows, Columns): (39208, 21) After Adding New Features Data Details (Rows, Columns): (39208, 42) After Adding New Features Data Details (Rows, Columns): (39208, 48) After Adding New Features Data Details (Rows, Columns): (39208, 50) After Adding New Features Data Details (Rows, Columns): (39208, 53) After Adding New Features Data Details (Rows, Columns): (39208, 69) -+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+--+-+- Aggregated data .....
| SK_ID_CURR | AMT_INCOME_TOTAL | AMT_CREDIT | DAYS_EMPLOYED | DAYS_BIRTH | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | CODE_GENDER | FLAG_OWN_REALTY | ... | AMT_DRAWINGS_ATM_CURRENT_MEAN | AMT_DRAWINGS_CURRENT_MEAN | AMT_DRAWINGS_OTHER_CURRENT_MEAN | AMT_DRAWINGS_POS_CURRENT_MEAN | AMT_BALANCE_min | AMT_BALANCE_max | AMT_BALANCE_mean | CNT_INSTALMENT_MATURE_CUM_min | CNT_INSTALMENT_MATURE_CUM_max | CNT_INSTALMENT_MATURE_CUM_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39203 | 295615 | 67500.0 | 73944.0 | -1800 | -16215 | 0.586475 | 0.468402 | 0.506484 | F | Y | ... | 0.0 | 17637.895 | 0.0 | 19842.631875 | 0.0 | 137628.045 | 76754.74 | 0.0 | 7.0 | 3.111111 |
| 39204 | 295863 | 90000.0 | 239850.0 | 365243 | -20309 | NaN | 0.645764 | 0.362277 | F | N | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 39205 | 123795 | 252000.0 | 780363.0 | -937 | -13572 | 0.305887 | 0.509765 | NaN | F | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 39206 | 100669 | 81000.0 | 670500.0 | 365243 | -22990 | 0.527931 | 0.679812 | 0.511892 | M | N | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 39207 | 453575 | 315000.0 | 900000.0 | -1444 | -15197 | NaN | 0.650462 | 0.215182 | M | Y | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 69 columns
-------------------------------------------
Grid Search......
Fitting 3 folds for each of 161 candidates, totalling 483 fits
The best roc_auc_score is: 0.6391650318473275
------ The best parameters are: {'clf': LogisticRegression(l1_ratio=0.25), 'clf__l1_ratio': 0.25, 'clf__penalty': 'l2', 'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>), 'feature_selection__k': 40, 'feature_selection__score_func': <function r_regression at 0x7f3f1fdfd290>, 'preparation__num_pipeline__scaling': StandardScaler()}
The accuracy score of this model is:0.92
Grid search Results:-----
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_clf | param_clf__l1_ratio | param_clf__penalty | param_feature_selection | param_feature_selection__k | param_feature_selection__score_func | ... | split1_test_score | split2_test_score | mean_test_score | std_test_score | rank_test_score | split0_train_score | split1_train_score | split2_train_score | mean_train_score | std_train_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 59 | 6.173357 | 0.139955 | 0.801307 | 0.049069 | LogisticRegression(l1_ratio=0.25) | 0.5 | l2 | SelectKBest(k=40, score_func=<function r_regre... | 40 | <function r_regression at 0x7f3f1fdfd290> | ... | 0.635750 | 0.641182 | 0.639165 | 0.002428 | 1 | 0.640139 | 0.642636 | 0.639663 | 0.640813 | 0.001304 |
| 23 | 6.238872 | 0.080555 | 0.767025 | 0.018354 | LogisticRegression(l1_ratio=0.25) | 0.25 | l2 | SelectKBest(k=40, score_func=<function r_regre... | 40 | <function r_regression at 0x7f3f1fdfd290> | ... | 0.635750 | 0.641182 | 0.639165 | 0.002428 | 1 | 0.640139 | 0.642636 | 0.639663 | 0.640813 | 0.001304 |
| 95 | 6.258930 | 0.178602 | 0.885521 | 0.064356 | LogisticRegression(l1_ratio=0.25) | 0.75 | l2 | SelectKBest(k=40, score_func=<function r_regre... | 40 | <function r_regression at 0x7f3f1fdfd290> | ... | 0.635750 | 0.641182 | 0.639165 | 0.002428 | 1 | 0.640139 | 0.642636 | 0.639663 | 0.640813 | 0.001304 |
| 91 | 6.776199 | 0.025043 | 0.773214 | 0.015896 | LogisticRegression(l1_ratio=0.25) | 0.75 | l2 | SelectKBest(k=40, score_func=<function r_regre... | 35 | <function r_regression at 0x7f3f1fdfd290> | ... | 0.634751 | 0.641233 | 0.638790 | 0.002877 | 4 | 0.639789 | 0.642011 | 0.639359 | 0.640387 | 0.001162 |
| 55 | 6.848706 | 0.166420 | 0.782969 | 0.018572 | LogisticRegression(l1_ratio=0.25) | 0.5 | l2 | SelectKBest(k=40, score_func=<function r_regre... | 35 | <function r_regression at 0x7f3f1fdfd290> | ... | 0.634751 | 0.641233 | 0.638790 | 0.002877 | 4 | 0.639789 | 0.642011 | 0.639359 | 0.640387 | 0.001162 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46 | 2.269154 | 0.314613 | 0.000000 | 0.000000 | LogisticRegression(l1_ratio=0.25) | 0.5 | l1 | SelectKBest(k=40, score_func=<function r_regre... | 40 | <function r_regression at 0x7f3f1fdfd290> | ... | NaN | NaN | NaN | NaN | 157 | NaN | NaN | NaN | NaN | NaN |
| 45 | 2.024679 | 0.070445 | 0.000000 | 0.000000 | LogisticRegression(l1_ratio=0.25) | 0.5 | l1 | SelectKBest(k=40, score_func=<function r_regre... | 40 | <function r_regression at 0x7f3f1fdfd290> | ... | NaN | NaN | NaN | NaN | 158 | NaN | NaN | NaN | NaN | NaN |
| 44 | 2.128255 | 0.120650 | 0.000000 | 0.000000 | LogisticRegression(l1_ratio=0.25) | 0.5 | l1 | SelectKBest(k=40, score_func=<function r_regre... | 40 | <function r_regression at 0x7f3f1fdfd290> | ... | NaN | NaN | NaN | NaN | 159 | NaN | NaN | NaN | NaN | NaN |
| 69 | 2.086492 | 0.069014 | 0.000000 | 0.000000 | LogisticRegression(l1_ratio=0.25) | 0.5 | elasticnet | SelectKBest(k=40, score_func=<function r_regre... | 40 | <function r_regression at 0x7f3f1fdfd290> | ... | NaN | NaN | NaN | NaN | 160 | NaN | NaN | NaN | NaN | NaN |
| 160 | 1.754747 | 0.116799 | 0.000000 | 0.000000 | XGBClassifier() | NaN | NaN | SelectKBest(score_func='all') | NaN | NaN | ... | NaN | NaN | NaN | NaN | 161 | NaN | NaN | NaN | NaN | NaN |
161 rows × 30 columns
--------------------------------------------------------- Experiment results so far......
| exp_name | Train Acc | Valid Acc | Test Acc | Train AUC | Valid AUC | Test AUC | |
|---|---|---|---|---|---|---|---|
| 0 | Gridserach_baseline69_features | 0.9198 | 0.9194 | 0.916 | 0.6405 | 0.6447 | 0.6508 |
-----Historical Experiment Results......
| exp_name | Train Acc | Valid Acc | Test Acc | Train AUC | Valid AUC | Test AUC | |
|---|---|---|---|---|---|---|---|
| 0 | Gridserach_baseline69_features | 0.9198 | 0.9194 | 0.916 | 0.6405 | 0.6447 | 0.6508 |
CPU times: user 6min 16s, sys: 16.9 s, total: 6min 33s Wall time: 18min 57s
gsv
GridSearchCV(cv=3,
estimator=Pipeline(steps=[('preparation',
FeatureUnion(transformer_list=[('num_pipeline',
Pipeline(steps=[('selector',
DataFrameSelector(attribute_names=['AMT_INCOME_TOTAL',
'AMT_CREDIT',
'DAYS_EMPLOYED',
'DAYS_BIRTH',
'EXT_SOURCE_1',
'EXT_SOURCE_2',
'EXT_SOURCE_3'])),
('numeric_imputer',
SimpleImputer()),
('scaling',
StandardScaler())])),
('ca...
'clf__max_depth': [10, 20],
'clf__max_features': [2, 3],
'clf__n_estimators': [100, 200],
'feature_selection': (SelectKBest(score_func='all'),)},
{'clf': (XGBClassifier(),),
'clf__learning_rate': [0.05, 0.1, 0.2],
'clf__max_depth': [1, 3, 5],
'clf__min_child_weight': [1, 3, 5],
'feature_selection': (SelectKBest(score_func='all'),)}],
return_train_score=True, scoring='roc_auc', verbose=2)Please rerun this cell to show the HTML repr or trust the notebook.GridSearchCV(cv=3,
estimator=Pipeline(steps=[('preparation',
FeatureUnion(transformer_list=[('num_pipeline',
Pipeline(steps=[('selector',
DataFrameSelector(attribute_names=['AMT_INCOME_TOTAL',
'AMT_CREDIT',
'DAYS_EMPLOYED',
'DAYS_BIRTH',
'EXT_SOURCE_1',
'EXT_SOURCE_2',
'EXT_SOURCE_3'])),
('numeric_imputer',
SimpleImputer()),
('scaling',
StandardScaler())])),
('ca...
'clf__max_depth': [10, 20],
'clf__max_features': [2, 3],
'clf__n_estimators': [100, 200],
'feature_selection': (SelectKBest(score_func='all'),)},
{'clf': (XGBClassifier(),),
'clf__learning_rate': [0.05, 0.1, 0.2],
'clf__max_depth': [1, 3, 5],
'clf__min_child_weight': [1, 3, 5],
'feature_selection': (SelectKBest(score_func='all'),)}],
return_train_score=True, scoring='roc_auc', verbose=2)FeatureUnion(transformer_list=[('num_pipeline',
Pipeline(steps=[('selector',
DataFrameSelector(attribute_names=['AMT_INCOME_TOTAL',
'AMT_CREDIT',
'DAYS_EMPLOYED',
'DAYS_BIRTH',
'EXT_SOURCE_1',
'EXT_SOURCE_2',
'EXT_SOURCE_3'])),
('numeric_imputer',
SimpleImputer()),
('scaling',
StandardScaler())])),
('cat_pipeline',
Pipeline(steps=[('selector',
DataFrameSelector(a...
'NAME_INCOME_TYPE',
'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS',
'NAME_HOUSING_TYPE',
'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START',
'HOUR_APPR_PROCESS_START',
'ORGANIZATION_TYPE',
'CODE_GENDER',
'FLAG_OWN_CAR',
'FLAG_OWN_REALTY'])),
('imputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('ohe',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]))])DataFrameSelector(attribute_names=['AMT_INCOME_TOTAL', 'AMT_CREDIT',
'DAYS_EMPLOYED', 'DAYS_BIRTH',
'EXT_SOURCE_1', 'EXT_SOURCE_2',
'EXT_SOURCE_3'])SimpleImputer()
StandardScaler()
DataFrameSelector(attribute_names=['NAME_CONTRACT_TYPE', 'NAME_TYPE_SUITE',
'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START',
'HOUR_APPR_PROCESS_START',
'ORGANIZATION_TYPE', 'CODE_GENDER',
'FLAG_OWN_CAR', 'FLAG_OWN_REALTY'])SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore', sparse=False)
<__main__.FeatureSelectionstub object at 0x7f3f1faaccd0>
<__main__.Estimatorstub object at 0x7f3f1e731490>
model.cv_results_
{'mean_fit_time': array([2.48565125, 2.04430699, 2.01541948, 2.07223034, 2.08545097,
2.05409718, 2.06603622, 2.14084554, 2.01899211, 2.05836995,
2.09921026, 2.05807797, 2.6221265 , 5.65309795, 6.2104249 ,
5.61886342, 2.65001949, 7.21040146, 7.05461287, 6.90072473,
2.56293607, 6.21637893, 6.29466796, 6.23887197, 1.97854567,
2.05324443, 2.08810894, 2.07230266, 2.05167135, 2.08942087,
2.0085942 , 2.11091375, 2.08412679, 2.04243088, 2.04592641,
2.08789754, 2.12788097, 1.98443397, 1.98040708, 2.03595948,
2.06382036, 2.06303445, 2.152004 , 2.02989054, 2.12825513,
2.02467895, 2.26915439, 2.40591415, 2.59343084, 5.6968507 ,
6.11132455, 5.38031356, 2.70923066, 7.33369128, 7.22722276,
6.84870577, 2.58078893, 6.30349453, 6.24392549, 6.17335733,
1.9653132 , 2.1527857 , 2.06682412, 2.04080836, 2.06734625,
2.0372839 , 2.07390658, 2.06458386, 2.0202651 , 2.08649174,
2.05754439, 2.06650853, 1.99780798, 2.14961187, 2.04506818,
2.05879807, 2.01245697, 2.05509098, 2.02872586, 2.08407148,
2.05267501, 2.1028072 , 2.08278966, 2.19081601, 2.65248617,
5.8215464 , 6.166876 , 5.38601351, 2.6167829 , 7.07501769,
7.03581413, 6.77619855, 2.64264353, 6.41647347, 6.23255102,
6.25893013, 1.98726567, 2.10781384, 2.03614187, 2.13085254,
2.06348419, 2.08767541, 2.08743191, 2.1253713 , 2.09452343,
2.09444817, 2.05877757, 2.17759379, 1.80814266, 1.76086322,
1.81438112, 1.82868258, 1.74364495, 1.83610098, 1.76463898,
1.83053946, 1.76346087, 1.7863857 , 1.85882044, 1.86020772,
1.85260121, 1.78695377, 1.8365322 , 1.79848623, 1.8860031 ,
2.01139784, 1.83734647, 1.81639409, 1.78058791, 1.82162817,
1.85557874, 1.85625347, 1.79939795, 1.79919783, 1.81098922,
1.77616175, 1.73943233, 1.76286991, 1.83554363, 1.80648335,
1.85742156, 1.88777486, 1.73978559, 1.82660961, 1.85382215,
1.90108617, 1.82160187, 1.84062274, 1.76111094, 1.78781915,
1.80621266, 1.8357021 , 1.76546582, 1.82640417, 1.81375694,
1.87068033, 1.8560137 , 1.859308 , 1.85391498, 1.90141749,
1.75474675]),
'mean_score_time': array([0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.74721646, 0.79630987, 0.76937254,
0.77797413, 0.77394493, 0.79160968, 0.75707396, 0.76271248,
0.79284255, 0.76791016, 0.8022182 , 0.76702491, 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.75132426, 0.77942753,
0.76788187, 0.75752894, 0.7355818 , 0.76768629, 0.77626046,
0.78296876, 0.81853763, 0.78703769, 0.76742045, 0.80130696,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0.8037955 ,
0.80001799, 0.78711708, 0.7794493 , 0.76596387, 0.7986722 ,
0.79497751, 0.77321418, 0.79165093, 0.78587683, 0.83897893,
0.8855207 , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. ]),
'mean_test_score': array([ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.58172802, 0.62619023, 0.62621087,
0.63791879, 0.58172835, 0.62719134, 0.62705368, 0.63878994,
0.58172893, 0.62780018, 0.62777628, 0.63916503, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, 0.58172802, 0.62619023,
0.62621087, 0.63791879, 0.58172835, 0.62719134, 0.62705368,
0.63878994, 0.58172893, 0.62780018, 0.62777628, 0.63916503,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.58172802,
0.62619023, 0.62621087, 0.63791879, 0.58172835, 0.62719134,
0.62705368, 0.63878994, 0.58172893, 0.62780018, 0.62777628,
0.63916503, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan]),
'mean_train_score': array([ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.58171026, 0.62829272, 0.6288367 ,
0.63983935, 0.58171055, 0.62971433, 0.62963054, 0.64038654,
0.58171111, 0.63035093, 0.63020375, 0.64081258, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, 0.58171026, 0.62829272,
0.6288367 , 0.63983935, 0.58171055, 0.62971433, 0.62963054,
0.64038654, 0.58171111, 0.63035093, 0.63020375, 0.64081258,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.58171026,
0.62829272, 0.6288367 , 0.63983935, 0.58171055, 0.62971433,
0.62963054, 0.64038654, 0.58171111, 0.63035093, 0.63020375,
0.64081258, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan]),
'param_clf': masked_array(data=[LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
LogisticRegression(l1_ratio=0.25),
DecisionTreeClassifier(), DecisionTreeClassifier(),
DecisionTreeClassifier(), DecisionTreeClassifier(),
DecisionTreeClassifier(), DecisionTreeClassifier(),
DecisionTreeClassifier(), DecisionTreeClassifier(),
DecisionTreeClassifier(), DecisionTreeClassifier(),
RandomForestClassifier(), RandomForestClassifier(),
RandomForestClassifier(), RandomForestClassifier(),
RandomForestClassifier(), RandomForestClassifier(),
RandomForestClassifier(), RandomForestClassifier(),
RandomForestClassifier(), RandomForestClassifier(),
RandomForestClassifier(), RandomForestClassifier(),
RandomForestClassifier(), RandomForestClassifier(),
RandomForestClassifier(), RandomForestClassifier(),
XGBClassifier(), XGBClassifier(), XGBClassifier(),
XGBClassifier(), XGBClassifier(), XGBClassifier(),
XGBClassifier(), XGBClassifier(), XGBClassifier(),
XGBClassifier(), XGBClassifier(), XGBClassifier(),
XGBClassifier(), XGBClassifier(), XGBClassifier(),
XGBClassifier(), XGBClassifier(), XGBClassifier(),
XGBClassifier(), XGBClassifier(), XGBClassifier(),
XGBClassifier(), XGBClassifier(), XGBClassifier(),
XGBClassifier(), XGBClassifier(), XGBClassifier()],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False],
fill_value='?',
dtype=object),
'param_clf__bootstrap': masked_array(data=[--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, True, True, True, True, True,
True, True, True, False, False, False, False, False,
False, False, False, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --],
mask=[ True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True],
fill_value='?',
dtype=object),
'param_clf__criterion': masked_array(data=[--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, 'gini', 'gini',
'gini', 'gini', 'gini', 'entropy', 'entropy',
'entropy', 'entropy', 'entropy', --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --],
mask=[ True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, False, False, False, False,
False, False, False, False, False, False, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True],
fill_value='?',
dtype=object),
'param_clf__l1_ratio': masked_array(data=[0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25,
0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25,
0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25,
0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75,
0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75,
0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75,
0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75,
0.75, 0.75, 0.75, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True],
fill_value='?',
dtype=object),
'param_clf__learning_rate': masked_array(data=[--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, 0.05, 0.05, 0.05, 0.05,
0.05, 0.05, 0.05, 0.05, 0.05, 0.1, 0.1, 0.1, 0.1, 0.1,
0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2],
mask=[ True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False],
fill_value='?',
dtype=object),
'param_clf__max_depth': masked_array(data=[--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, 1, 3, 5, 7, 9,
1, 3, 5, 7, 9, 10, 10, 10, 10, 20, 20, 20, 20, 10, 10,
10, 10, 20, 20, 20, 20, 1, 1, 1, 3, 3, 3, 5, 5, 5, 1,
1, 1, 3, 3, 3, 5, 5, 5, 1, 1, 1, 3, 3, 3, 5, 5, 5],
mask=[ True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False],
fill_value='?',
dtype=object),
'param_clf__max_features': masked_array(data=[--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, 2, 2, 3, 3, 2, 2, 3, 3, 2, 2,
3, 3, 2, 2, 3, 3, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --],
mask=[ True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True],
fill_value='?',
dtype=object),
'param_clf__min_child_weight': masked_array(data=[--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, 1, 3, 5, 1, 3, 5, 1, 3,
5, 1, 3, 5, 1, 3, 5, 1, 3, 5, 1, 3, 5, 1, 3, 5, 1, 3,
5],
mask=[ True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False],
fill_value='?',
dtype=object),
'param_clf__n_estimators': masked_array(data=[--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, 100, 200, 100, 200, 100, 200,
100, 200, 100, 200, 100, 200, 100, 200, 100, 200, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --],
mask=[ True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True],
fill_value='?',
dtype=object),
'param_clf__penalty': masked_array(data=['l1', 'l1', 'l1', 'l1', 'l1', 'l1', 'l1', 'l1', 'l1',
'l1', 'l1', 'l1', 'l2', 'l2', 'l2', 'l2', 'l2', 'l2',
'l2', 'l2', 'l2', 'l2', 'l2', 'l2', 'elasticnet',
'elasticnet', 'elasticnet', 'elasticnet', 'elasticnet',
'elasticnet', 'elasticnet', 'elasticnet', 'elasticnet',
'elasticnet', 'elasticnet', 'elasticnet', 'l1', 'l1',
'l1', 'l1', 'l1', 'l1', 'l1', 'l1', 'l1', 'l1', 'l1',
'l1', 'l2', 'l2', 'l2', 'l2', 'l2', 'l2', 'l2', 'l2',
'l2', 'l2', 'l2', 'l2', 'elasticnet', 'elasticnet',
'elasticnet', 'elasticnet', 'elasticnet', 'elasticnet',
'elasticnet', 'elasticnet', 'elasticnet', 'elasticnet',
'elasticnet', 'elasticnet', 'l1', 'l1', 'l1', 'l1',
'l1', 'l1', 'l1', 'l1', 'l1', 'l1', 'l1', 'l1', 'l2',
'l2', 'l2', 'l2', 'l2', 'l2', 'l2', 'l2', 'l2', 'l2',
'l2', 'l2', 'elasticnet', 'elasticnet', 'elasticnet',
'elasticnet', 'elasticnet', 'elasticnet', 'elasticnet',
'elasticnet', 'elasticnet', 'elasticnet', 'elasticnet',
'elasticnet', --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True],
fill_value='?',
dtype=object),
'param_feature_selection': masked_array(data=[SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all'),
SelectKBest(score_func='all')],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False],
fill_value='?',
dtype=object),
'param_feature_selection__k': masked_array(data=[30, 30, 30, 30, 35, 35, 35, 35, 40, 40, 40, 40, 30, 30,
30, 30, 35, 35, 35, 35, 40, 40, 40, 40, 30, 30, 30, 30,
35, 35, 35, 35, 40, 40, 40, 40, 30, 30, 30, 30, 35, 35,
35, 35, 40, 40, 40, 40, 30, 30, 30, 30, 35, 35, 35, 35,
40, 40, 40, 40, 30, 30, 30, 30, 35, 35, 35, 35, 40, 40,
40, 40, 30, 30, 30, 30, 35, 35, 35, 35, 40, 40, 40, 40,
30, 30, 30, 30, 35, 35, 35, 35, 40, 40, 40, 40, 30, 30,
30, 30, 35, 35, 35, 35, 40, 40, 40, 40, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True],
fill_value='?',
dtype=object),
'param_feature_selection__score_func': masked_array(data=[<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>,
<function r_regression at 0x7f3f1fdfd290>, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True],
fill_value='?',
dtype=object),
'param_preparation__num_pipeline__scaling': masked_array(data=['passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), 'passthrough',
Normalizer(norm='max'), Normalizer(), StandardScaler(),
'passthrough', Normalizer(norm='max'), Normalizer(),
StandardScaler(), --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --, --, --, --, --, --, --, --, --, --, --, --, --,
--, --],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True,
True],
fill_value='?',
dtype=object),
'params': [{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.25,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.5,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l1',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'l2',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 30,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 35,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': 'passthrough'},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer(norm='max')},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': Normalizer()},
{'clf': LogisticRegression(l1_ratio=0.25),
'clf__l1_ratio': 0.75,
'clf__penalty': 'elasticnet',
'feature_selection': SelectKBest(k=40, score_func=<function r_regression at 0x7f3f1fdfd290>),
'feature_selection__k': 40,
'feature_selection__score_func': <function sklearn.feature_selection._univariate_selection.r_regression>,
'preparation__num_pipeline__scaling': StandardScaler()},
{'clf': DecisionTreeClassifier(),
'clf__criterion': 'gini',
'clf__max_depth': 1,
'feature_selection': SelectKBest(score_func='all')},
{'clf': DecisionTreeClassifier(),
'clf__criterion': 'gini',
'clf__max_depth': 3,
'feature_selection': SelectKBest(score_func='all')},
{'clf': DecisionTreeClassifier(),
'clf__criterion': 'gini',
'clf__max_depth': 5,
'feature_selection': SelectKBest(score_func='all')},
{'clf': DecisionTreeClassifier(),
'clf__criterion': 'gini',
'clf__max_depth': 7,
'feature_selection': SelectKBest(score_func='all')},
{'clf': DecisionTreeClassifier(),
'clf__criterion': 'gini',
'clf__max_depth': 9,
'feature_selection': SelectKBest(score_func='all')},
{'clf': DecisionTreeClassifier(),
'clf__criterion': 'entropy',
'clf__max_depth': 1,
'feature_selection': SelectKBest(score_func='all')},
{'clf': DecisionTreeClassifier(),
'clf__criterion': 'entropy',
'clf__max_depth': 3,
'feature_selection': SelectKBest(score_func='all')},
{'clf': DecisionTreeClassifier(),
'clf__criterion': 'entropy',
'clf__max_depth': 5,
'feature_selection': SelectKBest(score_func='all')},
{'clf': DecisionTreeClassifier(),
'clf__criterion': 'entropy',
'clf__max_depth': 7,
'feature_selection': SelectKBest(score_func='all')},
{'clf': DecisionTreeClassifier(),
'clf__criterion': 'entropy',
'clf__max_depth': 9,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': True,
'clf__max_depth': 10,
'clf__max_features': 2,
'clf__n_estimators': 100,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': True,
'clf__max_depth': 10,
'clf__max_features': 2,
'clf__n_estimators': 200,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': True,
'clf__max_depth': 10,
'clf__max_features': 3,
'clf__n_estimators': 100,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': True,
'clf__max_depth': 10,
'clf__max_features': 3,
'clf__n_estimators': 200,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': True,
'clf__max_depth': 20,
'clf__max_features': 2,
'clf__n_estimators': 100,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': True,
'clf__max_depth': 20,
'clf__max_features': 2,
'clf__n_estimators': 200,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': True,
'clf__max_depth': 20,
'clf__max_features': 3,
'clf__n_estimators': 100,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': True,
'clf__max_depth': 20,
'clf__max_features': 3,
'clf__n_estimators': 200,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': False,
'clf__max_depth': 10,
'clf__max_features': 2,
'clf__n_estimators': 100,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': False,
'clf__max_depth': 10,
'clf__max_features': 2,
'clf__n_estimators': 200,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': False,
'clf__max_depth': 10,
'clf__max_features': 3,
'clf__n_estimators': 100,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': False,
'clf__max_depth': 10,
'clf__max_features': 3,
'clf__n_estimators': 200,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': False,
'clf__max_depth': 20,
'clf__max_features': 2,
'clf__n_estimators': 100,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': False,
'clf__max_depth': 20,
'clf__max_features': 2,
'clf__n_estimators': 200,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': False,
'clf__max_depth': 20,
'clf__max_features': 3,
'clf__n_estimators': 100,
'feature_selection': SelectKBest(score_func='all')},
{'clf': RandomForestClassifier(),
'clf__bootstrap': False,
'clf__max_depth': 20,
'clf__max_features': 3,
'clf__n_estimators': 200,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.05,
'clf__max_depth': 1,
'clf__min_child_weight': 1,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.05,
'clf__max_depth': 1,
'clf__min_child_weight': 3,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.05,
'clf__max_depth': 1,
'clf__min_child_weight': 5,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.05,
'clf__max_depth': 3,
'clf__min_child_weight': 1,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.05,
'clf__max_depth': 3,
'clf__min_child_weight': 3,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.05,
'clf__max_depth': 3,
'clf__min_child_weight': 5,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.05,
'clf__max_depth': 5,
'clf__min_child_weight': 1,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.05,
'clf__max_depth': 5,
'clf__min_child_weight': 3,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.05,
'clf__max_depth': 5,
'clf__min_child_weight': 5,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.1,
'clf__max_depth': 1,
'clf__min_child_weight': 1,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.1,
'clf__max_depth': 1,
'clf__min_child_weight': 3,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.1,
'clf__max_depth': 1,
'clf__min_child_weight': 5,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.1,
'clf__max_depth': 3,
'clf__min_child_weight': 1,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.1,
'clf__max_depth': 3,
'clf__min_child_weight': 3,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.1,
'clf__max_depth': 3,
'clf__min_child_weight': 5,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.1,
'clf__max_depth': 5,
'clf__min_child_weight': 1,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.1,
'clf__max_depth': 5,
'clf__min_child_weight': 3,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.1,
'clf__max_depth': 5,
'clf__min_child_weight': 5,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.2,
'clf__max_depth': 1,
'clf__min_child_weight': 1,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.2,
'clf__max_depth': 1,
'clf__min_child_weight': 3,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.2,
'clf__max_depth': 1,
'clf__min_child_weight': 5,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.2,
'clf__max_depth': 3,
'clf__min_child_weight': 1,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.2,
'clf__max_depth': 3,
'clf__min_child_weight': 3,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.2,
'clf__max_depth': 3,
'clf__min_child_weight': 5,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.2,
'clf__max_depth': 5,
'clf__min_child_weight': 1,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.2,
'clf__max_depth': 5,
'clf__min_child_weight': 3,
'feature_selection': SelectKBest(score_func='all')},
{'clf': XGBClassifier(),
'clf__learning_rate': 0.2,
'clf__max_depth': 5,
'clf__min_child_weight': 5,
'feature_selection': SelectKBest(score_func='all')}],
'rank_test_score': array([ 61, 127, 126, 125, 124, 123, 122, 121, 120, 119, 118, 117, 34,
25, 22, 7, 31, 16, 19, 4, 28, 10, 13, 1, 116, 115,
114, 113, 112, 111, 110, 109, 108, 107, 106, 105, 104, 103, 102,
101, 128, 129, 130, 131, 159, 158, 157, 156, 34, 25, 22, 7,
31, 16, 19, 4, 28, 10, 13, 1, 155, 154, 153, 152, 151,
150, 149, 148, 147, 160, 146, 144, 143, 142, 141, 140, 139, 138,
137, 136, 99, 134, 133, 132, 34, 25, 22, 7, 31, 16, 19,
4, 28, 10, 13, 1, 100, 145, 98, 79, 38, 39, 40, 41,
60, 42, 43, 44, 59, 37, 58, 57, 56, 55, 54, 53, 52,
51, 50, 49, 48, 47, 46, 45, 62, 82, 83, 84, 85, 86,
87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 81, 97, 80,
78, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74,
75, 76, 77, 135, 161], dtype=int32),
'split0_test_score': array([ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.57959472, 0.62770447, 0.6269332 ,
0.63953327, 0.57959572, 0.6287761 , 0.62808024, 0.64038537,
0.57959571, 0.6294235 , 0.62877515, 0.64056361, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, 0.57959472, 0.62770447,
0.6269332 , 0.63953327, 0.57959572, 0.6287761 , 0.62808024,
0.64038537, 0.57959571, 0.6294235 , 0.62877515, 0.64056361,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.57959472,
0.62770447, 0.6269332 , 0.63953327, 0.57959572, 0.6287761 ,
0.62808024, 0.64038537, 0.57959571, 0.6294235 , 0.62877515,
0.64056361, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan]),
'split0_train_score': array([ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.58279146, 0.62771414, 0.62683807,
0.63916889, 0.58279234, 0.62886122, 0.6279539 , 0.63978914,
0.58279233, 0.62931876, 0.62843064, 0.64013863, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, 0.58279146, 0.62771414,
0.62683807, 0.63916889, 0.58279234, 0.62886122, 0.6279539 ,
0.63978914, 0.58279233, 0.62931876, 0.62843064, 0.64013863,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.58279146,
0.62771414, 0.62683807, 0.63916889, 0.58279234, 0.62886122,
0.6279539 , 0.63978914, 0.58279233, 0.62931876, 0.62843064,
0.64013863, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan]),
'split1_test_score': array([ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.58089982, 0.62269531, 0.62261375,
0.63387293, 0.58089982, 0.62374966, 0.62322188, 0.6347511 ,
0.58090071, 0.62471553, 0.62415198, 0.63574966, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, 0.58089982, 0.62269531,
0.62261375, 0.63387293, 0.58089982, 0.62374966, 0.62322188,
0.6347511 , 0.58090071, 0.62471553, 0.62415198, 0.63574966,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.58089982,
0.62269531, 0.62261375, 0.63387293, 0.58089982, 0.62374966,
0.62322188, 0.6347511 , 0.58090071, 0.62471553, 0.62415198,
0.63574966, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan]),
'split1_train_score': array([ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.58211045, 0.62998051, 0.63155283,
0.64161768, 0.58211044, 0.63229146, 0.63216543, 0.64201114,
0.58211122, 0.63277598, 0.63259489, 0.642636 , nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, 0.58211045, 0.62998051,
0.63155283, 0.64161768, 0.58211044, 0.63229146, 0.63216543,
0.64201114, 0.58211122, 0.63277598, 0.63259489, 0.642636 ,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.58211045,
0.62998051, 0.63155283, 0.64161768, 0.58211044, 0.63229146,
0.63216543, 0.64201114, 0.58211122, 0.63277598, 0.63259489,
0.642636 , nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan]),
'split2_test_score': array([ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.58468952, 0.62817092, 0.62908568,
0.64035017, 0.5846895 , 0.62904825, 0.62985892, 0.64123335,
0.58469038, 0.62926152, 0.63040169, 0.64118183, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, 0.58468952, 0.62817092,
0.62908568, 0.64035017, 0.5846895 , 0.62904825, 0.62985892,
0.64123335, 0.58469038, 0.62926152, 0.63040169, 0.64118183,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.58468952,
0.62817092, 0.62908568, 0.64035017, 0.5846895 , 0.62904825,
0.62985892, 0.64123335, 0.58469038, 0.62926152, 0.63040169,
0.64118183, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan]),
'split2_train_score': array([ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.58022888, 0.62718351, 0.62811921,
0.63873148, 0.58022888, 0.62799032, 0.62877228, 0.63935936,
0.58022978, 0.62895806, 0.62958573, 0.6396631 , nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, 0.58022888, 0.62718351,
0.62811921, 0.63873148, 0.58022888, 0.62799032, 0.62877228,
0.63935936, 0.58022978, 0.62895806, 0.62958573, 0.6396631 ,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.58022888,
0.62718351, 0.62811921, 0.63873148, 0.58022888, 0.62799032,
0.62877228, 0.63935936, 0.58022978, 0.62895806, 0.62958573,
0.6396631 , nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan]),
'std_fit_time': array([0.29530645, 0.04504689, 0.08245922, 0.08315989, 0.06974903,
0.06575012, 0.04169157, 0.1079135 , 0.08632884, 0.04968825,
0.10081121, 0.0298441 , 0.12205869, 0.49471439, 0.28212413,
0.58786996, 0.12570799, 0.06086678, 0.17896106, 0.14666394,
0.05234892, 0.09084173, 0.13874006, 0.08055511, 0.06267692,
0.05576733, 0.03708739, 0.02424204, 0.03002727, 0.08901569,
0.07508639, 0.10522634, 0.06735109, 0.02823245, 0.06121629,
0.03698799, 0.05215017, 0.03811244, 0.05877044, 0.04860605,
0.05915721, 0.0581589 , 0.04538493, 0.06252936, 0.12065034,
0.07044545, 0.31461266, 0.27400849, 0.09456752, 0.48479909,
0.30224046, 0.73990313, 0.12024338, 0.10043807, 0.14719728,
0.16641984, 0.03749081, 0.04425184, 0.08026995, 0.13995506,
0.02801981, 0.04875308, 0.03459903, 0.07034049, 0.04762302,
0.07193453, 0.08400056, 0.06865469, 0.0273184 , 0.0690145 ,
0.03775067, 0.03212959, 0.06934699, 0.06656384, 0.02486417,
0.05952854, 0.06195668, 0.11091234, 0.02649336, 0.05823417,
0.00826529, 0.08215148, 0.02530785, 0.10171463, 0.16048854,
0.46047184, 0.28161107, 0.76761324, 0.09167254, 0.10278388,
0.09438553, 0.0250432 , 0.08685173, 0.08643547, 0.13811884,
0.17860215, 0.06093327, 0.10956375, 0.09563911, 0.05773718,
0.05837719, 0.05522838, 0.01314305, 0.03263509, 0.0352804 ,
0.05943876, 0.02404937, 0.06871763, 0.06520987, 0.04450078,
0.08638678, 0.05185144, 0.0465879 , 0.02032546, 0.06077664,
0.05837608, 0.05102838, 0.0724219 , 0.03635644, 0.04718437,
0.11307445, 0.04768278, 0.07925932, 0.06198392, 0.0349719 ,
0.12501599, 0.12914944, 0.03796778, 0.06458723, 0.04673831,
0.14029118, 0.04973648, 0.050416 , 0.07042412, 0.03139129,
0.10828824, 0.06308424, 0.0607253 , 0.08088295, 0.03120641,
0.02435003, 0.09674837, 0.02581149, 0.04845691, 0.03456497,
0.07629298, 0.00681167, 0.07121162, 0.0349915 , 0.08297136,
0.12002634, 0.03839719, 0.0374538 , 0.04921005, 0.02740809,
0.0371483 , 0.04856494, 0.06092083, 0.003113 , 0.0386104 ,
0.11679931]),
'std_score_time': array([0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.02635576, 0.02705131, 0.0340496 ,
0.03397194, 0.04168647, 0.03077361, 0.02478231, 0.02918765,
0.02192913, 0.00159951, 0.02067435, 0.01835365, 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.02721578, 0.0293674 ,
0.02231051, 0.02726467, 0.01148289, 0.0208939 , 0.00544307,
0.01857212, 0.01285782, 0.01502459, 0.00416773, 0.04906935,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0.06663886,
0.00595994, 0.02514633, 0.0197334 , 0.0196708 , 0.020582 ,
0.03380309, 0.01589579, 0.0314206 , 0.03602053, 0.05009879,
0.0643557 , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. ]),
'std_test_score': array([ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.00216082, 0.00247861, 0.00269107,
0.00288023, 0.00216048, 0.00243617, 0.00280511, 0.0028768 ,
0.00216077, 0.00218218, 0.00264739, 0.00242818, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, 0.00216082, 0.00247861,
0.00269107, 0.00288023, 0.00216048, 0.00243617, 0.00280511,
0.0028768 , 0.00216077, 0.00218218, 0.00264739, 0.00242818,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.00216082,
0.00247861, 0.00269107, 0.00288023, 0.00216048, 0.00243617,
0.00280511, 0.0028768 , 0.00216077, 0.00218218, 0.00264739,
0.00242818, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan]),
'std_train_score': array([ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.00108376, 0.00121295, 0.00199053,
0.00127009, 0.00108406, 0.00185666, 0.00182331, 0.00116208,
0.00108374, 0.00172108, 0.00175532, 0.00130389, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, 0.00108376, 0.00121295,
0.00199053, 0.00127009, 0.00108406, 0.00185666, 0.00182331,
0.00116208, 0.00108374, 0.00172108, 0.00175532, 0.00130389,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.00108376,
0.00121295, 0.00199053, 0.00127009, 0.00108406, 0.00185666,
0.00182331, 0.00116208, 0.00108374, 0.00172108, 0.00175532,
0.00130389, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan])}
Random forest
Decision Tree Classifier
XGBoost
Resampling
Random Forest after resampling
Decision tree after resampling
XGBoost after resampling
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.
from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75
from sklearn.metrics import roc_auc_score, roc_curve
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
test_class_scores = model.predict_proba(X_kaggle_test)[:, 1]
test_class_scores[0:10]
# Submission dataframe
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = test_class_scores
submit_df.head()
submit_df.to_csv("submission.csv",index=False)
! kaggle competitions submit -c home-credit-default-risk -f submission.csv -m "baseline submission"
Model Evaluation
from sklearn.tree import DecisionTreeClassifier
params ={
"tree__criterion":['gini','entropy'],
"tree__max_depth":range(1, 11,2)
}
dtc = Pipeline([
("preparation", data_prep_pipeline),
("tree", DecisionTreeClassifier())
])
treeGrid=GridSearchCV(dtc, param_grid=params,scoring='f1', return_train_score=True)
treeGrid.fit(X_train,y_train)
For this phase of the project, you will need to submit a write-up summarizing the work you did. The write-up form is available on Canvas (Modules-> Module 12.1 - Course Project - Home Credit Default Risk (HCDR)-> FP Phase 2 (HCDR) : write-up form ). It has the following sections:
In Phase 1 of our HCDR project we created a baseline model which was not accurate enough for predictions. So, in phase 2 we have improved our model by using various techniques. In Phase 02 of our project, we have focused on feature engineering and hyperparameter tuning. In addition to that we have also concentrated on feature selection, analysis of feature importance and ensemble methods. Firstly, we have done data aggregation by creating pipelines for all the secondary tables. The aggregated data is merged into the main table using pipeline. We have included imputation, scaling and normalizing in this process. Class base feature transformer is used for feature transformation. FeatureUnion is performed in order to combine the num_pipeline and cat_pipeline. A series of experiments are conducted to find the most important features. Finally we performed hyper parameter tuning on our models through gridsearch. The models we used in this phase are decision tree, random forest and XGBoost. Best model is determined by gridsearch on parameters of the model.
In phase 1 of our project we Exploratory Data Analysis We one-hot encoded all the category features for Feature Engineering Built a baseline pipeline using logistic regression Accuracy score on held out test set : 91.59% AUC score: 0.7356 Training Time: 35.7s As you can see our workflow below,we concentrated on feature engineering and hyperparameter tweaking in Phase 2 of our project. Aside from that, we've focused on feature selection, feature importance analysis, and ensemble approaches. To begin, we created pipelines for all of the secondary tables to aggregate data. Pipeline is used to combine the aggregated data into the main table. This procedure includes imputation, scaling, and normalization. For feature transformation, a class base feature transformer is employed. In class based feature transformation we have included the following tables for feature engineering.
For hyperparameter tuning we have don gridsearch cv. Then we used decision tree, random forest and XGBoost models. We performed gridsearch on parameters to determine the best model.
We have used class based feature engineering. Bureau Features,Bureau Balance, application, credit card balance are used. We have used class based feature transformer.
Parameter tuning is done using gridsearch. we obtained the folowing results for our hyperparameter tuning
Pipeline for aggregation of secondary tables: bureau_feature_pipeline = Pipeline([
("bureau_new_features", BureauFeaturesAgg()),
(
'feature_aggregater',
FeaturesAggregater(bur, bureau_features, "SK_ID_CURR", True)
),
])
bb_feature_pipeline = Pipeline([
("bureau_balance_new_features", BureauBalanceFeaturesAgg(bur))
])
prevApps_feature_pipeline = Pipeline([
(
'prevApps_aggregater',
FeaturesAggregater(pa, pa_features, "SK_ID_CURR", False)
),
])
ccb_feature_pipeline = Pipeline([
('credit_card_balance_new_features', CreditCardBalanceFeaturesAgg()),
(
'feature_aggregater',
FeaturesAggregater(ccb, ccb_features, "SK_ID_CURR", True)
),
])
pcb_feature_pipeline = Pipeline([
('POS_cash_balance_new_features', POSCashBalanceFeaturesAgg()),
])
("preparation", data_prep_pipeline),
("rmf", RandomForestClassifier())
])Pipeline for decision tree: dtc = Pipeline([
("preparation", data_prep_pipeline),
("tree", DecisionTreeClassifier())
])
Pipeline for XG boost: xgbPipe = Pipeline([
("preparation", data_prep_pipeline),
("xgb", xgboost.XGBClassifier())
])
The following results are obtained doing respective experiments
Wee have performed feature engineering and Hyperparameter tuning in this phase. We have also tried to improve our results using Decision Tree, andom Forest and XGBoost. The best model we have is XGBoost with a test accuracy of 0.8771 and the test accuracy of XGBoost after resampling is 0.8132. In phase 3 we are planning on developing the following :
Read the following: